如果出现captch,请立即登录并重试

2024-09-24 22:29:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在开发一个蜘蛛,它需要先登录并解析订单列表。该网站试图刮蹭使用验证码偶尔成功登录后,他们要么只要求验证码或再次登录与验证码详细信息。在

下面的spider按预期工作,它尝试登录并在check_login_response方法中检查登录是否成功,如果没有再次调用self.login()。spider通常会得到一个顺序url的列表,它们被加载到运行时的__init___方法中的启动url。在

现在的情况是,spider在parse_page方法中执行并停止,我可以看到这行log.msg('request %s' % url)中打印的url。但是spider从不执行带有起始url列表的parse方法。在

这个问题只会发生在重试验证码的情况下,在正常的登录场景下,它工作得很好,并调用了parse方法。。在

有什么建议吗?在

另外,我尝试了蜘蛛和爬行蜘蛛两个类,结果都一样

class SomeCrawlSpider(CrawlSpider):
    """ SomeSpider
    """
    name = 'some_order_details'
    allowed_domains = ['example.com']
    start_urls = []
    login_url = 'https://www.exmaple.com/login/'
    login_attempts = 0

    def __init__(self, qwargs, *args, **kwargs):
        # starting Scrapy logging
        self.start_urls = qwargs.get('order_details_urls')
        super(SomeCrawlSpider, self).__init__(*args, **kwargs)

    def start_requests(self):
        """ start_requests
        @return:
        """
        log.msg('starting requests')
        return self.init_request()

    def init_request(self):
        """ init_request
        @return: list
        """
        log.msg('init requests')
        return [Request(url=self.login_url, callback=self.login)]

    def check_login_response(self, response):
        """ check if response is logged in
        @param response:
        @return:
        """
        log.msg('check requests')

        # first we check if we get the captcha page again
        if "Type the characters you see in this image" in response.body_as_unicode() \
                or "What is your e-mail address?" in response.body_as_unicode():
            return self.login(response)

        return self.parse_page(response)

    def parse_page(self, response):
        """ parse_page
        @param response:
        @return:
        """
        self.login_attempts += 1
        if "Your Orders" in response.body_as_unicode():
            log.msg('user is logged in')
            self.credentials.last_used_at = datetime.utcnow().replace(tzinfo=utc)
            self.credentials.save()
            for url in self.start_urls:
                log.msg('request %s' % url)
                yield self.make_requests_from_url(url)

        yield OrderItem(auth_failed=True)

    def login(self, response):
        """ login
        @param response:
        @return:
        """
        # check the existence of credentials:
        if not any([self.credentials, self.credentials.username, self.credentials.password]):
            log.msg('Credentials is not set correctly')
            return OrderItem(auth_failed=True)

        log.msg('Trying to login')

        # check if response is captcha
        if "Type the characters you see in this image" in response.body_as_unicode():
            # captcha page before login
            log.msg('Captcha detected and guessing pass through')
            self.crawler.engine.pause()
            captcha = select_captcha_from_image(response)
            self.crawler.engine.unpause()
            log.msg('captcha detected: %s' % str(captcha))
            if not captcha:
                # if captcha returns Null
                log.msg('captcha was not decoded')
                raise OrderDetailsNoCaptchaException

            if "What is your e-mail address?" in response.body_as_unicode():
                log.msg('logging in via form: captcha + credentials')
                return FormRequest.from_response(response,
                                                 formdata={'guess': str(captcha),
                                                           'email': 'XXXX',
                                                           'password': 'XXXX'},
                                                 callback=self.check_login_response)
            else:
                log.msg('posting captcha')
                return FormRequest.from_response(response,
                                                 formdata={'field-keywords': str(captcha)},
                                                 callback=self.check_login_response)

        if 'What is your e-mail address?' in response.body_as_unicode():
            log.msg('logging in via form')
            return FormRequest.from_response(response,
                                             formdata={'email': 'XXX',
                                                       'password': 'XXXX'},
                                             callback=self.check_login_response)
        return OrderItem(auth_failed=True)

    def parse(self, response):
        """ this function returns
        @param response: Response Object
        @return: Dictionary
        """
        log.msg('Parsing items invoked')
        # here i parse response to item and yeild back below
        yield OrderItem(**item)

添加

当检测到验证码时就是这种情况

^{pr2}$

这个案子在哪里?没有

[2014-12-15 18:10:43,229: INFO/MainProcess] Received task: product.parse_messages_to_purchases[c06bafdc-0f39-4e43-8f74-ad899f30e799]
[2014-12-15 18:10:44,557: WARNING/Worker-1] 2014-12-15 18:10:44-0600 [scrapy] INFO: Enabled extensions: LogStats, CloseSpider, SpiderState
[2014-12-15 18:10:45,560: ERROR/Worker-1] Unable to read instance data, giving up
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled item pipelines: PurchaseWriterPipeline
[2014-12-15 18:10:45,570: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [scrapy] INFO: starting requests
[2014-12-15 18:10:45,570: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [scrapy] INFO: init requests
[2014-12-15 18:10:45,571: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [some_order_details] INFO: Spider opened
[2014-12-15 18:10:45,574: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [some_order_details] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x11311c3f8> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS.  Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
  connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

[2014-12-15 18:10:45,580: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x11311c3f8> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS.  Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
  connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

[2014-12-15 18:10:50,563: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:50-0600 [scrapy] INFO: Trying to login
[2014-12-15 18:10:50,563: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:50-0600 [scrapy] INFO: logging in via form
[2014-12-15 18:10:52,899: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: check requests
[2014-12-15 18:10:52,904: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: user is logged in
[2014-12-15 18:10:52,924: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: request https://www.some.com/gp/css/summary/edit.html?ie=UTF8&orderID=111-1260932-6725022&ref_=oh_aui_or_o02_&
[2014-12-15 18:10:54,164: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [scrapy] INFO: Parsing items invoked
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-08 00:00:00) while time zone support is active.
  RuntimeWarning)

[2014-12-15 18:10:54,191: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-08 00:00:00) while time zone support is active.
  RuntimeWarning)

[2014-12-15 18:10:54,262: INFO/MainProcess] Received task: semantic.get_product_and_create[939f73f6-f5bd-4d31-8a1b-6544de37b7b2] eta:[2014-12-16 00:12:34.246354+00:00]
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-09 00:00:00) while time zone support is active.
  RuntimeWarning)

[2014-12-15 18:10:54,268: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-09 00:00:00) while time zone support is active.
  RuntimeWarning)

[2014-12-15 18:10:54,286: INFO/MainProcess] Received task: semantic.get_product_and_create[acae6d98-2769-4d27-bcaa-a30ded095e4a] eta:[2014-12-16 00:12:34.285050+00:00]
[2014-12-15 18:10:54,288: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Closing spider (finished)
[2014-12-15 18:10:54,289: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 7527,
     'downloader/request_count': 6,
     'downloader/request_method_count/GET': 5,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 79284,
     'downloader/response_count': 6,
     'downloader/response_status_count/200': 3,
     'downloader/response_status_count/301': 1,
     'downloader/response_status_count/302': 2,
     'request_depth_max': 2,
     'scheduler/dequeued': 6,
     'scheduler/dequeued/memory': 6,
     'scheduler/enqueued': 6,
     'scheduler/enqueued/memory': 6}
[2014-12-15 18:10:54,290: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Spider closed (finished)

Tags: inselfinfologurlreturninitresponse