Scrapy自定义链接提取器限制跟踪链接数量

2024-09-26 17:51:22 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试基于Scrapy的LxmlLinkExtractor编写一个自定义链接提取器。这样做的目的是在达到限制后，包含一个maxpages参数来停止跟踪该域的链接（并转到下一个）。但是，我无法使自定义链接提取器工作：

from scrapy.linkextractors.lxmlhtml import *

class LimitedLinkExtractor(FilteringLinkExtractor):

    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=False,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=(),
                 strip=True, maxpages=10): #added maxpages 

        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        tag_func = lambda x: x in tags
        attr_func = lambda x: x in attrs

        lx = LxmlParserLinkExtractor(
            tag=tag_func,
            attr=attr_func,
            unique=unique,
            process=process_value,
            strip=strip,
            canonicalized=canonicalize,
        )

        super(FilteringLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions,maxpages=maxpages) #added maxpages 

    def extract_links(self, response):
        base_url = get_base_url(response)
        if self.restrict_xpaths:
            docs = [subdoc
                    for x in self.restrict_xpaths
                    for subdoc in response.xpath(x)]
        else:
            docs = [response.selector]

        all_links = []
        for doc in docs:
            links = self._extract_links(doc, response.url, response.encoding, base_url)
            links = links[0:self.max_pages] #added maxpages 
            all_links.extend(self._process_links(links))
        return unique_list(all_links)

除了我对#added maxpages的注释外，其他内容都与Scrapy在lxmlhtml.py中默认提供的LxmlLinkExtractor相同。我得到的错误是：

"TypeError: object.__init__() takes exactly one argument (the instance to initialize)"

Tags： in self response tags links process attrs unique

1条回答

网友

1楼 · 发布于 2024-09-26 17:51:22

super(FilteringLinkExtractor, self)→super(LimitedLinkExtractor, self)

Scrapy自定义链接提取器限制跟踪链接数量

相关问题更多 >

编程相关推荐

热门问题

热门文章

Scrapy自定义链接提取器限制跟踪链接数量

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >