对于错误地放置在<h>标记中的<p>元素，正确的Scrapy XPath是什么？

def parse(self, response): chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract() englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract() chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract() productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract() chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract() yield { 'chinesetitle': chinesetitle, 'englishtitle': englishtitle, 'chinesereleasedate': chinesereleasedate, 'productionregions': productionregions, 'chineseboxoffice': chineseboxoffice }

def parse(self, response): chinesetitle = response.css('.cont h2::text').extract_first() englishtitle = response.css('.cont h2 + p::text').extract_first() chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first() chinaboxoffice = chinaboxoffice.split('万')[0] chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first() chinareleasedate = chinareleasedate.split('：')[1].split('（')[0] countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first() countryoforigin = countryoforigin.split('：')[1] genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first() genre = genre.split('：')[1] director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()

2条回答

网友
1楼 · 编辑于 2024-09-30 22:28:05

这里有一些例子，你可以从中推断出最后一个。记住总是使用class或id属性来标识html元素。/div[3]/div[2]/div/div[1]/..不是一个好的做法。你知道吗
chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first() englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first() chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()) productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
为了找到chinesereleasedate，我使用了文本包含'上映时间'的p元素。你必须解析它才能得到准确的值。你知道吗
为了找到productionregions，我从列表中选择了第7个选择器response.xpath('//div[@class="ziliaofr"]/div/p')[6]选择了文本。一个更好的方法是检查文本是否包含如上所述的'。你知道吗
编辑：回答评论中的问题
response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
返回一个类似'\r\n 上映时间：2017-7-27（中国）\r\n '的字符串，它不是您要查找的字符串。你可以像这样清理它：
chinesereleasedate = chinesereleasedate.split('：')[1].split('（')[0]
这给了我们正确的日期。你知道吗

网友
2楼 · 编辑于 2024-09-30 22:28:05

您不必用xpath折磨自己，顺便说一下，您可以使用css：
response.css('.cont h2::text').extract_first() # '战狼2' response.css('.cont h2 + p::text').extract_first() # 'Wolf Warriors 2'

相关问题更多 >

编程相关推荐

热门问题

热门文章