在Scrapy中爬行经过身份验证的会话问题的回答

在Scrapy中爬行经过身份验证的会话

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

不要重写<code>CrawlSpider</code>中的<code>parse</code>函数： 使用<code>CrawlSpider</code>时，不应重写<code>parse</code>函数。这里的<code>CrawlSpider</code>文档中有一个警告：<a href="http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule" rel="noreferrer">http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule</a> 这是因为使用<code>CrawlSpider</code>，<code>parse</code>（任何请求的默认回调）发送要由<code>Rule</code>处理的响应 <hr/> 爬网前登录： 为了在蜘蛛开始爬行之前进行某种初始化，可以使用<code>InitSpider</code>（它继承自<code>CrawlSpider</code>），并重写<code>init_request</code>函数。当蜘蛛初始化时，在开始爬行之前，将调用此函数。 为了让蜘蛛开始爬行，您需要调用<code>self.initialized</code>。 您可以阅读负责这个<a href="https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/init.py" rel="noreferrer">here</a>的代码（它有有用的docstring）。 <hr/> 示例： <pre><code>from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import Rule class MySpider(InitSpider): name = 'myspider' allowed_domains = ['example.com'] login_page = 'http://www.example.com/login' start_urls = ['http://www.example.com/useful_page/', 'http://www.example.com/another_useful_page/'] rules = ( Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True), ) def init_request(self): """This function is called before crawling starts.""" return Request(url=self.login_page, callback=self.login) def login(self, response): """Generate a login request.""" return FormRequest.from_response(response, formdata={'name': 'herman', 'password': 'password'}, callback=self.check_login_response) def check_login_response(self, response): """Check the response returned by a login request to see if we are successfully logged in. """ if "Hi Herman" in response.body: self.log("Successfully logged in. Let's start crawling!") # Now the crawling can begin.. return self.initialized() else: self.log("Bad times :(") # Something went wrong, we couldn't log in, so nothing happens. def parse_item(self, response): # Scrape data from page </code></pre> <hr/> 保存项目： Spider返回的项将被传递到负责对数据执行任何操作的管道。我建议您阅读文档：<a href="http://doc.scrapy.org/en/0.14/topics/item-pipeline.html" rel="noreferrer">http://doc.scrapy.org/en/0.14/topics/item-pipeline.html</a> 如果您在<code>Item</code>方面有任何问题，请毫不犹豫地弹出一个新问题，我将尽力帮助您。

在Scrapy中爬行经过身份验证的会话

1 个回答

相关Python问题