我试图爬网一个网站,这是只能通过代理访问。我使用scrapy创建了一个名为scrapy_crawler的项目,其结构如下:
我已经读到需要在settings.py中启用HttpProxyMiddleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100
}
从那以后我有点迷路了。我想我需要在请求中包含代理,但我不确定在哪里可以这样做。我在middleware.py文件中尝试了以下内容
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
r.meta['proxy'] = 'http://username:password@myproxy:port'
yield r
这里是digtionary.py文件供参考
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ImdbCrawler(CrawlSpider):
name = 'digtionary'
allowed_domains = ['www.mywebsite.com']
start_urls = ['https://mywebsite.com/digital/pages/start.aspx#']
rules = (Rule(LinkExtractor()),)
任何形式的帮助都将不胜感激。提前谢谢
目前没有回答
相关问题 更多 >
编程相关推荐