我有一个只有登录名的站点,我想在http://145.100.108.148/login2/login.php
登录,然后爬网下一个页面,即http://145.100.108.148/login2/index.php
。在
两个.html站点都必须保存到磁盘上。在
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
class TestSpider(CrawlSpider):
name = 'testspider'
login_page = 'http://145.100.108.148/login2/login.php'
start_urls = ['http://145.100.108.148/login2/index.php'
]
rules = (
Rule(LinkExtractor(allow=r'.*'),
callback='parse_item', follow=True),
)
login_user = 'test@hotmail.com'
login_pass = 'test'
def start_request(self):
"""This function is called before crawling starts"""
return [Request(url=self.login_page, callback=self.login)]
def login(self, response):
"""Generate a login request"""
return FormRequest.from_response(response,
formdata={
'email': self.login_user,
'pass': self.login_pass},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in"""
if b"Dashboard" in response.body:
self.logger.info("successfully logged in. Let's start crawling!")
return self.initialized()
else:
self.logger.info("NOT LOGGED IN :(")
# Something went wrong, we couldn't log in, so nothing happens.
return
def parse_item(self, response):
"""Save pages to disk"""
self.logger.info('Hi, this is an item page! %s', response.url)
page = response.url.split("/")[-2]
filename = 'scraped-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
输出
^{pr2}$因此,当爬行时,无论爬行器是否已登录,都会有no
输出。即使创建了IF/ELSE语句,也要从check_login_response
开始
我也不确定是否有一个爬虫程序已经过身份验证。
只有1个保存的文件,名为scraped-login2.html
,而我希望至少有3个文件。分别是register page
、login page
和{
CrawlSpider
从Spider
继承,init_request
在从{到
^{pr2}$接下来,
response.body
中得到的响应将是字节。所以你需要改变到
感谢@Tarun Lalwani和一些尝试和错误,结果如下:
相关问题 更多 >
编程相关推荐