在Scrapy中爬行经过身份验证的会话

class MySpider(CrawlSpider): name = 'myspider' allowed_domains = ['domain.com'] start_urls = ['http://www.domain.com/login/'] rules = ( Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True), ) def parse(self, response): hxs = HtmlXPathSelector(response) if not "Hi Herman" in response.body: return self.login(response) else: return self.parse_item(response) def login(self, response): return [FormRequest.from_response(response, formdata={'name': 'herman', 'password': 'password'}, callback=self.parse)] def parse_item(self, response): i['url'] = response.url # ... do more things return i

3条回答

网友

1楼 · 编辑于 2024-09-29 08:29:40

为了使上面的解决方案能够工作，我必须使crawpsider从InitSpider继承，而不再从BaseSpider继承，方法是在不完整的源代码上更改以下内容。在scrapy/contrib/spiders/crawl.py文件中：

添加：from scrapy.contrib.spiders.init import InitSpider
将class CrawlSpider(BaseSpider)更改为class CrawlSpider(InitSpider)

否则蜘蛛就不会调用init_request方法。

还有其他更简单的方法吗？

网友

2楼 · 编辑于 2024-09-29 08:29:40

不要重写CrawlSpider中的parse函数：

使用CrawlSpider时，不应重写parse函数。这里的CrawlSpider文档中有一个警告：http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

这是因为使用CrawlSpider，parse（任何请求的默认回调）发送要由Rule处理的响应

爬网前登录：

为了在蜘蛛开始爬行之前进行某种初始化，可以使用InitSpider（它继承自CrawlSpider），并重写init_request函数。当蜘蛛初始化时，在开始爬行之前，将调用此函数。

为了让蜘蛛开始爬行，您需要调用self.initialized。

您可以阅读负责这个here的代码（它有有用的docstring）。

示例：

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    login_page = 'http://www.example.com/login'
    start_urls = ['http://www.example.com/useful_page/',
                  'http://www.example.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page

保存项目：

Spider返回的项将被传递到负责对数据执行任何操作的管道。我建议您阅读文档：http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

如果您在Item方面有任何问题，请毫不犹豫地弹出一个新问题，我将尽力帮助您。

网友

3楼 · 编辑于 2024-09-29 08:29:40

如果您需要的是Http认证，请使用提供的中间件钩子。

在settings.py中

DOWNLOADER_MIDDLEWARE = [ 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware']

在您的spider class中添加属性

http_user = "user"
http_pass = "pass"

相关问题更多 >

编程相关推荐

热门问题

热门文章