我有一个列表,比如说100个代理,为了测试它们,我向谷歌提出请求并检查响应。 当通过python请求运行这些请求时,每个请求都会成功返回,但是当在Scrapy下尝试相同的操作时,99%的代理都会失败。我在Scrapy中是否遗漏了什么或使用了错误的代理
代理以以下格式存储在文件中:
http://123.123.123.123:8080
https://234.234.234.234:8080
http://321.321.321.321:8080
...
下面是我用来用python请求测试它们的脚本
import requests
proxyPool = []
with open("proxy_pool.txt", "r") as f:
proxyPool = f.readlines()
proxyPool = [x.strip() for x in proxyPool]
for proxyItem in proxyPool:
# Strip the http/s from the ip
proxy = proxyItem.rsplit("/")[-1].split(":")
proxy = "{proxy}:{port}".format(proxy=proxy[0], port=proxy[1])
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36", }
proxySession = requests.Session()
proxySession.proxies = {"http://": proxy, "https://": proxy}
proxySession.headers.update(headers)
resp = proxySession.get("https://www.google.com/")
if resp.status_code == 200:
print(f"Requests with proxies: {proxySession.proxies} - Successful")
else:
print(f"Requests with proxies: {proxySession.proxies} - Unsuccessful")
time.sleep(3)
还有蜘蛛的痒
class ProxySpider(scrapy.Spider):
name = "proxyspider"
start_urls = ["https://www.google.com/"]
def start_requests(self):
with open("proxy_pool.txt", "r") as f:
for proxy in f.readlines():
proxy = proxy.strip()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36", }
yield Request(url=self.start_urls[0], callback=self.parse, headers=headers, meta={"proxy": proxy}, dont_filter=True)
def parse(self, response):
self.logger.info(f'Parsing: {response.url}')
if response.status == 200:
print(f"Requests with proxies: {response.meta['proxy']} - Successful")
else:
print(f"Requests with proxies: {response.meta['proxy']} - Unsuccessful")
在用
requests
构建的代码示例上,您实现了多个会话(1个会话-1个代理)但是,在scrapy默认设置中,应用程序将对所有代理使用单个} meta key
cookiejar
。它将为每个代理发送相同的cookie数据。
您需要在请求中使用^{
如果Web服务器接收到来自多个IP的请求,并且在CookieHeader中传输了单个
sessionId
,则看起来可疑,Web服务器能够将其识别为bot并禁止所有使用的IP。-很可能这件事就是这样发生的相关问题 更多 >
编程相关推荐