如何在Scrapy crawler中代理?

2024-09-19 23:36:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图爬网一个网站,这是只能通过代理访问。我使用scrapy创建了一个名为scrapy_crawler的项目,其结构如下:

project structure

我已经读到需要在settings.py中启用HttpProxyMiddleware

DOWNLOADER_MIDDLEWARES = {   
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100
}

从那以后我有点迷路了。我想我需要在请求中包含代理,但我不确定在哪里可以这样做。我在middleware.py文件中尝试了以下内容

 def process_start_requests(self, start_requests, spider):
    # Called with the start requests of the spider, and works
    # similarly to the process_spider_output() method, except
    # that it doesn’t have a response associated.

    # Must return only requests (not items).
    for r in start_requests:
        r.meta['proxy'] = 'http://username:password@myproxy:port'
        yield r

这里是digtionary.py文件供参考

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ImdbCrawler(CrawlSpider):
  name = 'digtionary'
  allowed_domains = ['www.mywebsite.com']
  start_urls = ['https://mywebsite.com/digital/pages/start.aspx#']
  rules = (Rule(LinkExtractor()),)

任何形式的帮助都将不胜感激。提前谢谢


Tags: 文件thefrompyimport代理rulerequests