如何在响应中避免不需要的字段(scrapy)

2024-10-03 02:46:31 发布

您现在位置:Python中文网/ 问答频道 /正文

大家好,提前谢谢:

当我运行scrapy时,我将项目放在.json中,但是我得到的不是我想要的项目,而是一些垃圾:

download latency,download tieout, depth and download slot are the not desired ones

 1 import scrapy
 2
 3 class LibresSpider(scrapy.Spider):
 4     name = 'libres'
 5     allowed_domains = ['www.todostuslibros.com']
 6     start_urls = ['https://www.todostuslibros.com/mas_vendidos/']
 7
 8     def parse(self, response):
 9         for tfg in response.css('li.row-fluid'):
10             doc={}
11             data = tfg.css('book-basics')
12             doc['titulo'] = tfg.css('h2 a::text').extract_first()
13             doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())
14
15             yield scrapy.Request(doc['url'], callback=self.parse_detail, meta=doc)
16
17         next = response.css('a.next::attr(href)').extract_first()
18         if next is not None:
19            next_page = response.urljoin(next)
20            yield scrapy.Request(next_page, callback=self.parse)
21
22     def parse_detail(self, response):
23
24         detail = response.meta
25         detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
26         detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
27
28         yield detail

我知道这些不需要的数据是随响应而来的(第26行),但我想知道如何避免以json结尾的数据


Tags: textselfdocparseresponsedownloadpageextract
1条回答
网友
1楼 · 发布于 2024-10-03 02:46:31

请使用更明确的标题来帮助其他可能有同样问题的人;”“垃圾”是一个非常模糊的词

您可以在Scrapy文档here中获得有关meta属性的更多信息

A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.

如果要避免在json中使用Scrapy填充所有这些信息,可以执行以下操作:

def parse(self, response):
  for tfg in response.css('li.row-fluid'):
    doc={}
    data = tfg.css('book-basics')
    doc['titulo'] = tfg.css('h2 a::text').extract_first()
    doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())

    request = scrapy.Request(doc['url'], callback=self.parse_detail)
    request.meta['detail'] = doc
    yield request

  next = response.css('a.next::attr(href)').extract_first()
  if next is not None:
    next_page = response.urljoin(next)
    yield scrapy.Request(next_page, callback=self.parse)

def parse_detail(self, response):
  detail = response.meta['detail']
  detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
  detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
  yield detail

相关问题 更多 >