scrapy跟随外部链接仅限一层深度

网友

1楼 · 编辑于 2024-10-05 14:24:51

我通过向回调函数传递一个参数找到了一个解决方案。如果url是内部链接，我将flag设置为true（否则为false）。如果标志返回false（外部链接），爬虫程序不会提取新的链接。这里是我的示例代码：

class BrokenLinksSpider(CrawlSpider):
name = test
start_urls = "your_url"

def parse(self, response):
    flag = response.meta.get('flag')
    if flag or flag==None:
        extractor = LinkExtractor(deny_domains="")
        links = extractor.extract_links(response)
        for link in links:
            if link.url[:8]=="your_url":
                new_request = Request(link.url, callback=self.parse,meta={'flag': True})
            else:
                new_request = Request(link.url, callback=self.parse,meta={'flag': False})
            yield new_request

网友

2楼 · 编辑于 2024-10-05 14:24:51

不是一个内置的解决方案，但我相信你必须自己中断递归。您可以通过在spider中保留一个域数组（一组）并中断或忽略来轻松做到这一点。在

类似的事情：

from urllib.parse import urlparse

self.track = set()

...
domain = tracktraurlparse(response.url).netloc
x.add(domain)
if len(x) > MAX_RECURSION:
   x.remove(domain)
   # raise StopIteration (# if you're within a generator)
   return None

网友

3楼 · 编辑于 2024-10-05 14:24:51

您可以将spider基于CrawlSpider类，并将Rule与实现的process_links方法一起使用，并传递给Rule。该方法将在不需要的链接被跟踪之前过滤掉。从documentation：

process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

相关问题更多 >

编程相关推荐

热门问题

热门文章

scrapy跟随外部链接仅限一层深度

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >