xpath不能只选择一个html标记

# -*- coding: utf-8 -*- import scrapy from gumtree.items import GumtreeItem class FlatSpider(scrapy.Spider): name = "flat" allowed_domains = ["gumtree.com"] start_urls = ( 'https://www.gumtree.com/flats-for-sale', ) def parse(self, response): item = GumtreeItem() item['title'] = response.xpath('//*[@class="listing-title"][1]/text()').extract() return item

2条回答

网友

1楼 · 编辑于 2024-10-01 22:33:10

严格地说应该是response.xpath('(//*[@class="listing-title"])[1]/text()')，但是如果您想要获取每个广告的标题（例如创建一个项目），您可能应该这样做：

for article in response.xpath('//article[@data-q]'):
     item = GumtreeItem()
     item['title'] = article.css('.listing-title::text').extract_first()
     yield item

网友

2楼 · 编辑于 2024-10-01 22:33:10

这是因为第一个元素实际上是空的-只过滤掉非空值并使用extract_first()-对我有效：

$ scrapy shell "https://www.gumtree.com/flats-for-sale" -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36"
In [1]: response.xpath('//*[@class="listing-title"][1]/text()[normalize-space(.)]').extract_first().strip()
Out[1]: u'REDUCED to sell! Stunning Hove sea view flat.'

相关问题更多 >

编程相关推荐

热门问题

热门文章