class BrokenLinksSpider(CrawlSpider):
name = test
start_urls = "your_url"
def parse(self, response):
flag = response.meta.get('flag')
if flag or flag==None:
extractor = LinkExtractor(deny_domains="")
links = extractor.extract_links(response)
for link in links:
if link.url[:8]=="your_url":
new_request = Request(link.url, callback=self.parse,meta={'flag': True})
else:
new_request = Request(link.url, callback=self.parse,meta={'flag': False})
yield new_request
process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.
我通过向回调函数传递一个参数找到了一个解决方案。如果url是内部链接,我将flag设置为true(否则为false)。如果标志返回false(外部链接),爬虫程序不会提取新的链接。这里是我的示例代码:
不是一个内置的解决方案,但我相信你必须自己中断递归。您可以通过在spider中保留一个域数组(一组)并中断或忽略来轻松做到这一点。在
类似的事情:
您可以将spider基于
CrawlSpider
类,并将Rule
与实现的process_links
方法一起使用,并传递给Rule
。该方法将在不需要的链接被跟踪之前过滤掉。从documentation:相关问题 更多 >
编程相关推荐