LinkExtractor带条件提取 - 问答 - Python中文网

LinkExtractor带条件提取

2024-09-30 12:17:57 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我有一个爬虫程序，它接收url，然后跟踪开始url中每个url的下一页链接及其工作状态

rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="pagnNext"]',)), callback="parse_start_url", follow= True),)

然而，正如你所想象的，我开始在某个时候得到一些网址的验证码。我听说可能有人看不见的蜜罐，但在设计的html代码，让你点击，以确定你是一个机器人。你知道吗

我想让提取器提取链接有条件的例如不提取和点击如果CSS样式显示：不存在或者类似的

这可行吗

Tags：程序 url parse 链接状态 callback rule 爬虫

1条回答

网友

1楼 · 发布于 2024-09-30 12:17:57

我会这样做：

def parse_page1(self, response):
    if (response.css("thing i want to check exists"))
       return scrapy.Request(response.xpath('//a[@class="pagnNext"]'),
                             callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

官方文件： https://doc.scrapy.org/en/latest/topics/request-response.html

注意：至于你的验证码问题，试着弄乱你的设置。至少确保您的下载延迟设置为0以外的值。查看其他选项https://doc.scrapy.org/en/latest/topics/settings.html

相关问题更多 >

编程相关推荐

热门问题

热门文章