Scrapy子类化LinkExtractor引发TypeError:MyLinkExtractor()得到意外的关键字参数'allow'

2024-10-02 10:21:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用Scrapy刮一个新闻网站,并用sqlalchemy将刮下的条目保存到数据库中。 爬网作业定期运行,我想忽略的网址没有改变,因为上次爬网。你知道吗

我正在尝试将LinkExtractor子类化并返回一个空列表,以防响应.url爬网时间比更新时间长。你知道吗

但当我运行“scrapy crawl spider\u name”时,我得到:

TypeError: MyLinkExtractor() got an unexpected keyword argument 'allow'

代码:

def MyLinkExtractor(LinkExtractor):
    '''This class should redefine the method extract_links to
    filter out all links from pages which were not modified since
    the last crawling'''
    def __init__(self, *args, **kwargs):
        """
        Initializes database connection and sessionmaker.
        """
        engine = db_connect()
        self.Session = sessionmaker(bind=engine)
        super(MyLinkExtractor, self).__init__(*args, **kwargs)

    def extract_links(self, response):
        all_links = super(MyLinkExtractor, self).extract_links(response)

        # Return empty list if current url was recently crawled
        session = self.Session()
        url_in_db = session.query(Page).filter(Page.url==response.url).all()
        if url_in_db and url_in_db[0].last_crawled.replace(tzinfo=pytz.UTC) > item['header_last_modified']:
            return []

        return all_links

。。。你知道吗

class MySpider(CrawlSpider):

    def __init__(self, *args, **kwargs):
        """
        Initializes database connection and sessionmaker.
        """
        engine = db_connect()
        self.Session = sessionmaker(bind=engine)
        super(MySpider, self).__init__(*args, **kwargs)

    ...

    # Define list of regex of links that should be followed
    links_regex_to_follow = [
        r'some_url_pattern',
        ]

    rules = (Rule(MyLinkExtractor(allow=links_regex_to_follow),
                  callback='handle_news',
                  follow=True),    
             )

    def handle_news(self, response):

        item = MyItem()
        item['url'] = response.url
        session = self.Session()

        # ... Process the item and extract meaningful info

        # Register when the item was crawled
        item['last_crawled'] = datetime.datetime.utcnow().replace(tzinfo=pytz.UTC)

        # Register when the page was last-modified
        date_string = response.headers.get('Last-Modified', None).decode('utf-8')
        item['header_last_modified'] = get_datetime_from_http_str(date_string)

        yield item

最奇怪的是,如果我将规则定义中的LinkExtractor替换为LinkExtractor,它就会运行。你知道吗

但是如果我将MyLinkExtractor留在规则的定义中,并将MyLinkExtractor重新定义为:

def MyLinkExtractor(LinkExtractor):
    '''This class should redefine the method extract_links to
    filter out all links from pages which were not modified since
    the last crawling'''
    pass

我也有同样的错误。你知道吗


Tags: thetoselfurldbresponsedefextract
1条回答
网友
1楼 · 发布于 2024-10-02 10:21:17

您的MyLinkExtractor不是class,而是函数,因为您用def而不是class声明了它。这很难发现,因为Python允许在其他函数中声明函数,并且没有一个名称是真正保留的。你知道吗

不管怎样,我相信堆栈跟踪在没有正确实例化类的情况下会有一些不同——您会看到最后一个出错的函数的名称(MyLinkExtractor的__init__)。你知道吗

相关问题 更多 >

    热门问题