使用自定义dupefilter scrapy取消所有请求

2024-10-02 16:30:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试抓取一个每个产品都有多个变体的网店。每个变体都有一个URL(即10050545/V009)。但是,在所有单个变体URL上,也可以对所有其他变体进行爬网。这意味着蜘蛛实际上只需要访问一个变体

为此,我尝试实现一个Dupefilter,从URL中删除变量ID。我假设这将导致第一个请求,例如,NL/10050545/V009通过,但是NL/10050545/V010没有通过

它实际做的是让请求通过,但看起来它们似乎因为某种原因被取消了。我确信的一件事是回调没有被调用

当然,我可以在发送请求之前检查一下,但是DupeFilter看起来更漂亮:)

from scrapy.dupefilters import RFPDupeFilter


class Dupefilter(RFPDupeFilter):
    def request_fingerprint(self, request):
        if 'product/' in request.url:
            # NL/10050545/V009 > NL/10050545/ where V009 is variant id
            request = request.replace(url='/'.join(request.url.split('/')[:-1:]))
        return super().request_fingerprint(request)

相关刮痕统计:

'dupefilter/filtered': 40
'response_received_count': 27,
'scheduler/dequeued': 52,
'scheduler/dequeued/memory': 52,
'scheduler/enqueued': 52,
'scheduler/enqueued/memory': 52,
'request_depth_max': 26,

而且请求非常简单:

for href in res.xpath('.//a[contains(@class,"-product-status-ACTIVE")]/@href').getall():
    url = response.urljoin(href)
    self.logger.debug(f"Visiting {url}")
    yield Request(url=url, callback=self.parse_details)

日志:

2020-10-22 21:11:48 [sloggi] INFO: Found 265 products on category page  (url https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6))
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0003
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0004
2020-10-22 21:11:48 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0004> (referer: https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6)
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0034
2020-10-22 21:11:48 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0034> (referer: https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6)
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10004713/0003
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10004713/0004
2020-10-22 21:11:48 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10004713/0004> (referer: https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6)

Tags: httpsdebugcomurlrequestnl变体product