scrapy从pythonrequests返回不同的状态代码,用于具有相同头和方法的GET请求

2024-10-04 05:21:00 发布

您现在位置:Python中文网/ 问答频道 /正文

有一段时间,我一直在使用cloudscraper包从具有CloudFlare保护的网站上抓取数据。这种方法最近停止工作。在调查该问题时,scrapy似乎失败了(获取503和验证码页面),而使用“requests.get”“successed”执行相同的请求,具有相同的标题

复制:

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:13:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)

因此,只要尝试获取该页面,就会返回一个CloudFlare保护页面。如果我使用python请求,结果也是一样的(503响应):

>>> import requests
>>> rq_response = requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:21:47 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:21:47 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 503 None

因此,我使用cloudscraper获取适当的头文件/cookie:

>>> import cloudscraper
>>> cs = cloudscraper.create_scraper()
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:19:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:19:17 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
# Note: I do the below for the second time, so I can capture the request-headers for testing. I'm aware I can do this more efficiently
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:25:19 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
>>> cs_response.request.headers
{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Cookie': '__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D'}

因此,如果此请求现在与这些cookie一起使用,我应该能够在scrapy中使用它们来获取数据:

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:29:49 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)

所以它在刮痧中不起作用。要检查它是否没有更改标题(我禁用了所有downloader中间件进行测试):

>>> response.request.headers
{b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en-US,en;q=0.5'], b'Accept-Encoding': [b'gzip, deflate'], b'Cookie': [b'__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D']}

如果我对python请求尝试相同的请求:

>>> requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:33:13 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:33:14 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
<Response [200]>

所以“相同”请求(或者:我期望完全相同的请求)在python请求中可以工作,但在scrapy中不能工作。有人知道为什么会这样吗

注意:我已经用Ubuntu服务器20.04.1和Python 3.7.9在AWS EC2实例上测试了上述内容,因为从我的本地IP上看不到cloudflare页面

我所尝试的:

  • 禁用所有下载程序中间件
  • Scrapy的最新版本(2.4.1)

Tags: httpsdebugcomnonegetresponsecsheaders