scrapy从pythonrequests返回不同的状态代码，用于具有相同头和方法的GET请求

2024-10-04 05:21:00 发布

男 | 程序猿一只，喜欢编程写python代码。

有一段时间，我一直在使用cloudscraper包从具有CloudFlare保护的网站上抓取数据。这种方法最近停止工作。在调查该问题时，scrapy似乎失败了（获取503和验证码页面），而使用“requests.get”“successed”执行相同的请求，具有相同的标题

复制：

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:13:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)

因此，只要尝试获取该页面，就会返回一个CloudFlare保护页面。如果我使用python请求，结果也是一样的（503响应）：

>>> import requests
>>> rq_response = requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:21:47 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:21:47 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 503 None

因此，我使用cloudscraper获取适当的头文件/cookie：

>>> import cloudscraper
>>> cs = cloudscraper.create_scraper()
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:19:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:19:17 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
# Note: I do the below for the second time, so I can capture the request-headers for testing. I'm aware I can do this more efficiently
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:25:19 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
>>> cs_response.request.headers
{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Cookie': '__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D'}

因此，如果此请求现在与这些cookie一起使用，我应该能够在scrapy中使用它们来获取数据：

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:29:49 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)

所以它在刮痧中不起作用。要检查它是否没有更改标题（我禁用了所有downloader中间件进行测试）：

>>> response.request.headers
{b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en-US,en;q=0.5'], b'Accept-Encoding': [b'gzip, deflate'], b'Cookie': [b'__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D']}

如果我对python请求尝试相同的请求：

>>> requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:33:13 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:33:14 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
<Response [200]>

所以“相同”请求（或者：我期望完全相同的请求）在python请求中可以工作，但在scrapy中不能工作。有人知道为什么会这样吗

注意：我已经用Ubuntu服务器20.04.1和Python 3.7.9在AWS EC2实例上测试了上述内容，因为从我的本地IP上看不到cloudflare页面

我所尝试的：

禁用所有下载程序中间件
Scrapy的最新版本（2.4.1）

Tags： https debug com none get response cs headers

0条回答

目前没有回答

scrapy从pythonrequests返回不同的状态代码，用于具有相同头和方法的GET请求

相关问题更多 >

编程相关推荐

热门问题

热门文章

scrapy从pythonrequests返回不同的状态代码，用于具有相同头和方法的GET请求

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >