为什么代理在Scrapy中失败,但在pythonrequests库下发出成功的请求

2024-05-18 15:18:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个列表,比如说100个代理,为了测试它们,我向谷歌提出请求并检查响应。 当通过python请求运行这些请求时,每个请求都会成功返回,但是当在Scrapy下尝试相同的操作时,99%的代理都会失败。我在Scrapy中是否遗漏了什么或使用了错误的代理

代理以以下格式存储在文件中:

http://123.123.123.123:8080
https://234.234.234.234:8080
http://321.321.321.321:8080
...

下面是我用来用python请求测试它们的脚本

import requests

proxyPool = []
with open("proxy_pool.txt", "r") as f:
    proxyPool = f.readlines()

proxyPool = [x.strip() for x in proxyPool]

for proxyItem in proxyPool:
    # Strip the http/s from the ip
    proxy = proxyItem.rsplit("/")[-1].split(":")
    proxy = "{proxy}:{port}".format(proxy=proxy[0], port=proxy[1])
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36", }

    proxySession = requests.Session()
    proxySession.proxies = {"http://": proxy, "https://": proxy}
    proxySession.headers.update(headers)
    resp = proxySession.get("https://www.google.com/")

    if resp.status_code == 200:
        print(f"Requests with proxies: {proxySession.proxies} - Successful")
    else:
        print(f"Requests with proxies: {proxySession.proxies} - Unsuccessful")
    time.sleep(3)

还有蜘蛛的痒

class ProxySpider(scrapy.Spider):
    name = "proxyspider"

    start_urls = ["https://www.google.com/"]

    def start_requests(self):
        with open("proxy_pool.txt", "r") as f:
            for proxy in f.readlines():
                proxy = proxy.strip()
                headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36", }

                yield Request(url=self.start_urls[0], callback=self.parse, headers=headers, meta={"proxy": proxy}, dont_filter=True)

    def parse(self, response):
        self.logger.info(f'Parsing: {response.url}')
        if response.status == 200:
            print(f"Requests with proxies: {response.meta['proxy']} - Successful")
        else:
            print(f"Requests with proxies: {response.meta['proxy']} - Unsuccessful")

Tags: httpsselfhttp代理forresponsewithrequests
1条回答
网友
1楼 · 发布于 2024-05-18 15:18:59

在用requests构建的代码示例上,您实现了多个会话(1个会话-1个代理)

但是,在scrapy默认设置中,应用程序将对所有代理使用单个cookiejar
它将为每个代理发送相同的cookie数据。
您需要在请求中使用^{} meta key

如果Web服务器接收到来自多个IP的请求,并且在CookieHeader中传输了单个sessionId,则看起来可疑,Web服务器能够将其识别为bot并禁止所有使用的IP。-很可能这件事就是这样发生的

相关问题 更多 >

    热门问题