我正在尝试抓取一个每个产品都有多个变体的网店。每个变体都有一个URL(即10050545/V009
)。但是,在所有单个变体URL上,也可以对所有其他变体进行爬网。这意味着蜘蛛实际上只需要访问一个变体
为此,我尝试实现一个Dupefilter
,从URL中删除变量ID。我假设这将导致第一个请求,例如,NL/10050545/V009
通过,但是NL/10050545/V010
没有通过
它实际做的是让请求通过,但看起来它们似乎因为某种原因被取消了。我确信的一件事是回调没有被调用
当然,我可以在发送请求之前检查一下,但是DupeFilter
看起来更漂亮:)
from scrapy.dupefilters import RFPDupeFilter
class Dupefilter(RFPDupeFilter):
def request_fingerprint(self, request):
if 'product/' in request.url:
# NL/10050545/V009 > NL/10050545/ where V009 is variant id
request = request.replace(url='/'.join(request.url.split('/')[:-1:]))
return super().request_fingerprint(request)
相关刮痕统计:
'dupefilter/filtered': 40
'response_received_count': 27,
'scheduler/dequeued': 52,
'scheduler/dequeued/memory': 52,
'scheduler/enqueued': 52,
'scheduler/enqueued/memory': 52,
'request_depth_max': 26,
而且请求非常简单:
for href in res.xpath('.//a[contains(@class,"-product-status-ACTIVE")]/@href').getall():
url = response.urljoin(href)
self.logger.debug(f"Visiting {url}")
yield Request(url=url, callback=self.parse_details)
日志:
2020-10-22 21:11:48 [sloggi] INFO: Found 265 products on category page (url https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6))
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0003
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0004
2020-10-22 21:11:48 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0004> (referer: https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6)
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0034
2020-10-22 21:11:48 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10018381/0034> (referer: https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6)
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10004713/0003
2020-10-22 21:11:48 [sloggi] DEBUG: Visiting https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10004713/0004
2020-10-22 21:11:48 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://b2b.triumph.com/webstore/v2/product/NL_sloggiPROD/10004713/0004> (referer: https://b2b.triumph.com/webstore/v2/products/NL_sloggiPROD?6)
目前没有回答
相关问题 更多 >
编程相关推荐