刮，在管道前等待

上下文

假设有一个站点https://example.com，我想将其删除

它的结构如下：

<body> <ul> <li> title_foo <a href="https://example.com/title_foo">a desription</a> </li> <li> title_bar <a href="https://example.com/title_bar">an another desription</a> </li> </ul> </body>

在<a>链接之后，我可以在创建项目之前获得所需的描述，并将它们发送到我的管道，管道将把它们存储到我的数据库中

例如，假设当我遵循https://example.com/title_foo时，我将该描述检索到

<div class="a-descrption"> a description </div>

在{}中，我：

class MyItem(scrapy.Item): title = scrapy.Field() description = scrapy.Field()

我的蜘蛛应该是这样的：

import scrapy from scrapy_project.items import MyItem class MySpider(scrapy.Spider): name = "my_spider" def start_requests(self): urls = [ 'https://example.com', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): for li in response.xpath('//li/text()').getall() yield {"title": li }

至少，我希望，我没有测试它，如果有什么问题，请纠正我

1条回答

网友

1楼 · 发布于 2024-10-04 03:21:50

在您的示例中，刮取的唯一字段是title，因此我不完全确定，但听起来您想在https://example.com刮取title，请求一个详细信息页面（如https://example.com/title_foo）在那里刮取描述，然后yield同时使用description和title刮取项目

如果是这种情况，这类问题的常见解决方案是使用cb_kwargs或meta。（如果您使用的是Scrapy v1.7+，建议使用cb_kwargs）

cb_kwargs允许您将任意数据传递到请求的回调函数中。需要注意的是，数据是作为关键字参数传递的。例如：

class MySpider(scrapy.Spider):
    name = "my_spider"

    start_urls = ['https://example.com',]

    def parse(self, response):
        for li in response.xpath('//li'):
            title = li.xpath('text()').get()
            url_to_detail_page = li.xpath('a/@href').get()
            yield scrapy.Request(
                url=url_to_detail_page,
                callback=self.parse_detail_page,
                cb_kwargs={
                    'title': title
                })

    def parse_detail_page(self, response, title):  # Notice title as a keyword arg
        description = response.xpath('//div[@class="a-descrption"]//text()').getall()
        yield {
            'title': title,
            'description': description,
        }

在这里，存储在title中的第一页上的数据“伴随着”对细节页的请求，当调用回调函数时title作为参数接收，因此您可以从函数访问它

上下文

问题:

相关问题更多 >

编程相关推荐

热门问题

热门文章