我正在开发一个蜘蛛,它需要先登录并解析订单列表。该网站试图刮蹭使用验证码偶尔成功登录后,他们要么只要求验证码或再次登录与验证码详细信息。在
下面的spider按预期工作,它尝试登录并在check_login_response
方法中检查登录是否成功,如果没有再次调用self.login()
。spider通常会得到一个顺序url的列表,它们被加载到运行时的__init___
方法中的启动url。在
现在的情况是,spider在parse_page
方法中执行并停止,我可以看到这行log.msg('request %s' % url)
中打印的url。但是spider从不执行带有起始url列表的parse方法。在
这个问题只会发生在重试验证码的情况下,在正常的登录场景下,它工作得很好,并调用了parse方法。。在
有什么建议吗?在
另外,我尝试了蜘蛛和爬行蜘蛛两个类,结果都一样
class SomeCrawlSpider(CrawlSpider):
""" SomeSpider
"""
name = 'some_order_details'
allowed_domains = ['example.com']
start_urls = []
login_url = 'https://www.exmaple.com/login/'
login_attempts = 0
def __init__(self, qwargs, *args, **kwargs):
# starting Scrapy logging
self.start_urls = qwargs.get('order_details_urls')
super(SomeCrawlSpider, self).__init__(*args, **kwargs)
def start_requests(self):
""" start_requests
@return:
"""
log.msg('starting requests')
return self.init_request()
def init_request(self):
""" init_request
@return: list
"""
log.msg('init requests')
return [Request(url=self.login_url, callback=self.login)]
def check_login_response(self, response):
""" check if response is logged in
@param response:
@return:
"""
log.msg('check requests')
# first we check if we get the captcha page again
if "Type the characters you see in this image" in response.body_as_unicode() \
or "What is your e-mail address?" in response.body_as_unicode():
return self.login(response)
return self.parse_page(response)
def parse_page(self, response):
""" parse_page
@param response:
@return:
"""
self.login_attempts += 1
if "Your Orders" in response.body_as_unicode():
log.msg('user is logged in')
self.credentials.last_used_at = datetime.utcnow().replace(tzinfo=utc)
self.credentials.save()
for url in self.start_urls:
log.msg('request %s' % url)
yield self.make_requests_from_url(url)
yield OrderItem(auth_failed=True)
def login(self, response):
""" login
@param response:
@return:
"""
# check the existence of credentials:
if not any([self.credentials, self.credentials.username, self.credentials.password]):
log.msg('Credentials is not set correctly')
return OrderItem(auth_failed=True)
log.msg('Trying to login')
# check if response is captcha
if "Type the characters you see in this image" in response.body_as_unicode():
# captcha page before login
log.msg('Captcha detected and guessing pass through')
self.crawler.engine.pause()
captcha = select_captcha_from_image(response)
self.crawler.engine.unpause()
log.msg('captcha detected: %s' % str(captcha))
if not captcha:
# if captcha returns Null
log.msg('captcha was not decoded')
raise OrderDetailsNoCaptchaException
if "What is your e-mail address?" in response.body_as_unicode():
log.msg('logging in via form: captcha + credentials')
return FormRequest.from_response(response,
formdata={'guess': str(captcha),
'email': 'XXXX',
'password': 'XXXX'},
callback=self.check_login_response)
else:
log.msg('posting captcha')
return FormRequest.from_response(response,
formdata={'field-keywords': str(captcha)},
callback=self.check_login_response)
if 'What is your e-mail address?' in response.body_as_unicode():
log.msg('logging in via form')
return FormRequest.from_response(response,
formdata={'email': 'XXX',
'password': 'XXXX'},
callback=self.check_login_response)
return OrderItem(auth_failed=True)
def parse(self, response):
""" this function returns
@param response: Response Object
@return: Dictionary
"""
log.msg('Parsing items invoked')
# here i parse response to item and yeild back below
yield OrderItem(**item)
添加
当检测到验证码时就是这种情况
^{pr2}$这个案子在哪里?没有
[2014-12-15 18:10:43,229: INFO/MainProcess] Received task: product.parse_messages_to_purchases[c06bafdc-0f39-4e43-8f74-ad899f30e799]
[2014-12-15 18:10:44,557: WARNING/Worker-1] 2014-12-15 18:10:44-0600 [scrapy] INFO: Enabled extensions: LogStats, CloseSpider, SpiderState
[2014-12-15 18:10:45,560: ERROR/Worker-1] Unable to read instance data, giving up
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled item pipelines: PurchaseWriterPipeline
[2014-12-15 18:10:45,570: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [scrapy] INFO: starting requests
[2014-12-15 18:10:45,570: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [scrapy] INFO: init requests
[2014-12-15 18:10:45,571: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [some_order_details] INFO: Spider opened
[2014-12-15 18:10:45,574: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [some_order_details] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x11311c3f8> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS. Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
[2014-12-15 18:10:45,580: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x11311c3f8> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS. Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
[2014-12-15 18:10:50,563: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:50-0600 [scrapy] INFO: Trying to login
[2014-12-15 18:10:50,563: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:50-0600 [scrapy] INFO: logging in via form
[2014-12-15 18:10:52,899: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: check requests
[2014-12-15 18:10:52,904: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: user is logged in
[2014-12-15 18:10:52,924: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: request https://www.some.com/gp/css/summary/edit.html?ie=UTF8&orderID=111-1260932-6725022&ref_=oh_aui_or_o02_&
[2014-12-15 18:10:54,164: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [scrapy] INFO: Parsing items invoked
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-08 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,191: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-08 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,262: INFO/MainProcess] Received task: semantic.get_product_and_create[939f73f6-f5bd-4d31-8a1b-6544de37b7b2] eta:[2014-12-16 00:12:34.246354+00:00]
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-09 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,268: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-09 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,286: INFO/MainProcess] Received task: semantic.get_product_and_create[acae6d98-2769-4d27-bcaa-a30ded095e4a] eta:[2014-12-16 00:12:34.285050+00:00]
[2014-12-15 18:10:54,288: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Closing spider (finished)
[2014-12-15 18:10:54,289: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7527,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 5,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 79284,
'downloader/response_count': 6,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 2,
'request_depth_max': 2,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6}
[2014-12-15 18:10:54,290: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Spider closed (finished)
目前没有回答
相关问题 更多 >
编程相关推荐