用scrapy刮掉所有主机名 - 问答 - Python中文网

用scrapy刮掉所有主机名

2024-10-06 15:22:52 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我注意到，我正在尝试清理的一些网站将我重定向到另一个主机名：https://www.citibank.com.au/重定向到例如https://www1.citibank.com.au/。虽然Scrapy会刮取常规子域（www.subdomain.example.com），但它会跳过www2.example.com

这显然是Scrapy的工作原理：https://doc.ebichu.cc/scrapy/即：

OffsiteMiddleware classscrapy.spidermiddlewares.offsite.OffsiteMiddleware Filters out Requests for URLs outside the domains covered by the spider.
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any domain in the list are also allowed. E.g. the rule www.example.org will also allow bob.www.example.org but not www2.example.com nor example.com.

我的问题是：如何确保所有具有不同主机名的子域（例如www2.example.com）都被刮取

我能想到的解决方案是用url的所有变体（例如[www.example.com、www1.example.com、www2.example.com等）]填充允许的域列表。这就是我要走的路吗？还是我在这些零碎的文档中忽略了任何可以更好地解决这个问题的选项

Tags： the 子域 https com example www out 重定向

1条回答

网友

1楼 · 发布于 2024-10-06 15:22:52

这就是allowed_domains在scrapy中的工作方式。它的目的是过滤任何来自您允许的域的异地请求

如果您希望scrapy访问多个异地域，您不需要使用allowed_domains属性，只需将其从spider中删除（或将其保留为空），请求就会通过

在您提到的特定情况下，如果它们都是同一个域的一部分，并且您在镜像方面有问题（"www1..." "www2..."），请使用实际的域

allowed_domains = ['example.com']

相关问题更多 >

编程相关推荐

热门问题

热门文章