URL因scrapy中的时间戳身份验证而过期

2024-06-01 07:35:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图爬亚马逊杂货英国,并获得杂货类别,我用的是联合产品广告api。我的请求进入队列,但由于请求的过期时间为15分钟,有些请求在进入队列15分钟后被爬网,这意味着它们在爬网时过期,并产生400错误。我正在考虑一种将请求成批排队的解决方案,但是如果实现控制成批处理请求,那么即使这样也会失败,因为问题是成批准备请求,而不是成批处理请求。不幸的是,Scrapy几乎没有关于这个用例的文档,所以如何批量准备请求呢?你知道吗

from scrapy.spiders import XMLFeedSpider
from scrapy.utils.misc import arg_to_iter
from scrapy.loader.processors import TakeFirst


from crawlers.http import AmazonApiRequest
from crawlers.items import (AmazonCategoryItemLoader)
from crawlers.spiders import MySpider


class AmazonCategorySpider(XMLFeedSpider, MySpider):
    name = 'amazon_categories'
    allowed_domains = ['amazon.co.uk', 'ecs.amazonaws.co.uk']
    marketplace_domain_name = 'amazon.co.uk'
    download_delay = 1
    rotate_user_agent = 1

    grocery_node_id = 344155031

    # XMLSpider attributes
    iterator = 'xml'
    itertag = 'BrowseNodes/BrowseNode/Children/BrowseNode'

    def start_requests(self):
        return arg_to_iter(
            AmazonApiRequest(
                qargs=dict(Operation='BrowseNodeLookup',
                           BrowseNodeId=self.grocery_node_id),
                meta=dict(ancestor_node_id=self.grocery_node_id)
            ))

    def parse(self, response):
        response.selector.remove_namespaces()
        has_children = bool(response.xpath('//BrowseNodes/BrowseNode/Children'))
        if not has_children:
            return response.meta['category']
        # here the request should be configurable to allow batching
        return super(AmazonCategorySpider, self).parse(response)

    def parse_node(self, response, node):
        category = response.meta.get('category')
        l = AmazonCategoryItemLoader(selector=node)
        l.add_xpath('name', 'Name/text()')
        l.add_value('parent', category)
        node_id = l.get_xpath('BrowseNodeId/text()', TakeFirst(), lambda x: int(x))
        l.add_value('node_id', node_id)
        category_item = l.load_item()
        return AmazonApiRequest(
            qargs=dict(Operation='BrowseNodeLookup',
                       BrowseNodeId=node_id),
            meta=dict(ancestor_node_id=node_id,
                      category=category_item)
        )

Tags: tonamefromimportselfidnodereturn
1条回答
网友
1楼 · 发布于 2024-06-01 07:35:15

一种方法是:

因为有两个地方可以生成请求,所以可以利用priority属性对来自parse方法的请求进行优先级排序:

class MySpider(Spider):
    name = 'myspider'

    def start_requests(self):
        for url in very_long_list:
            yield Request(url)

    def parse(self, response):
        for url in short_list:
            yield Reuest(url, self.parse_item, priority=1000)

    def parse_item(self, response):
        # parse item

在本例中,scrapy将优先处理来自parse的请求,这将允许您避免时间限制。你知道吗

有关Request.priority的更多信息:

priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.

关于scrapy docs

相关问题 更多 >