如何在响应中避免不需要的字段（scrapy）

1 import scrapy 2 3 class LibresSpider(scrapy.Spider): 4 name = 'libres' 5 allowed_domains = ['www.todostuslibros.com'] 6 start_urls = ['https://www.todostuslibros.com/mas_vendidos/'] 7 8 def parse(self, response): 9 for tfg in response.css('li.row-fluid'): 10 doc={} 11 data = tfg.css('book-basics') 12 doc['titulo'] = tfg.css('h2 a::text').extract_first() 13 doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first()) 14 15 yield scrapy.Request(doc['url'], callback=self.parse_detail, meta=doc) 16 17 next = response.css('a.next::attr(href)').extract_first() 18 if next is not None: 19 next_page = response.urljoin(next) 20 yield scrapy.Request(next_page, callback=self.parse) 21 22 def parse_detail(self, response): 23 24 detail = response.meta 25 detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract()) 26 detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract()) 27 28 yield detail

1条回答

网友

1楼 · 发布于 2024-10-03 02:46:31

请使用更明确的标题来帮助其他可能有同样问题的人；”“垃圾”是一个非常模糊的词

您可以在Scrapy文档here中获得有关meta属性的更多信息

A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.

如果要避免在json中使用Scrapy填充所有这些信息，可以执行以下操作：

def parse(self, response):
  for tfg in response.css('li.row-fluid'):
    doc={}
    data = tfg.css('book-basics')
    doc['titulo'] = tfg.css('h2 a::text').extract_first()
    doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())

    request = scrapy.Request(doc['url'], callback=self.parse_detail)
    request.meta['detail'] = doc
    yield request

  next = response.css('a.next::attr(href)').extract_first()
  if next is not None:
    next_page = response.urljoin(next)
    yield scrapy.Request(next_page, callback=self.parse)

def parse_detail(self, response):
  detail = response.meta['detail']
  detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
  detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
  yield detail

相关问题更多 >

编程相关推荐

热门问题

热门文章