等待所有请求完成

class SpiderCrawler(scrapy.Spider): name = "spiderman" allowed_domains = ["mywebsite.com"] start_urls = [ "https://www.mywebsite.com/items", ] def parse(self, response): for sel in response.xpath('//div[@id="col"]'): items = MyItem() items['categories'] = [] sections = sel.xpath('//tbody') category_count = 5 #filler for count in range(1, category_count): category = Category() #set categories for item, link in zip(items.xpath("text()"), items.xpath("@href")): subItem = SubItem() #set subItems subItem['link'] = "www.mywebsite.com/nexturl" #the problem request = scrapy.Request(subItem['link'], callback=self.parse_sub_item) request.meta['sub_item'] = subItem yield request category['sub_items'].append(subItem) items['categories'].append(category) #I want this yield to not be executed until ALL requests are complete yield items def parse_sub_item(self, response): fields = #some xpath subItem = response.meta["sub_item"] subItem['fields'] = #some xpath subItem['another_field'] = #some xpath

1条回答

网友

1楼 · 发布于 2024-09-22 20:39:10

Scrapy背后的想法是根据请求导出一些项目。你要做的是你想要所有的东西在一起，只返回一个项目，这是不可能的。在

不过，只要稍微修改一下代码，就可以达到您想要的效果。按当前的状态导出项并创建一个item pipeline例如，它将在parse方法中生成的项转换为一个大项（dictionary？）包含类别及其sub_items，并在调用close_spider方法时一起导出所有内容。在

在这种情况下，您可以处理异步项处理并将结果分组在一起。在

相关问题更多 >

编程相关推荐

热门问题

热门文章