Python Scrapy get article body,extract_first()获取非

2024-06-29 00:42:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用Scrapy从新闻网站获取文章正文。在

import scrapy
import sys 
import json

class ReutersPage(scrapy.Spider):
    name = "reutersPage"
    start_urls = [
        'https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C'
    ]


    def parse(self, response):
        articleBody = response.css('div.StandardArticleBody_body::text').extract_first()
        print('######## Article body ##########')
        print(articleBody)
        yield {
            'body': articleBody
        }  

我尝试在div StandardArticleBody_body中获取文本,但总是得到值。在

输出是

^{pr2}$

Tags: importdiv网站responsesysbody新闻scrapy
2条回答

没有任何文本直接属于您选择的div,而是属于它的后代。选择器路径和::之间的空格将获得所有子体的text,而不仅仅是所选节点的文本。在

试试这个

articleBody = response.css('div.StandardArticleBody_body ::text').extract_first()

这样您就得到了div后代的所有文本。在

In [27]: response.css('div.StandardArticleBody_body > p::text').extract()
Out[27]: 
['SANTIAGO, Oct 26 (Reuters) - Shares in lithium miner SQM jumped 2.7 percent on          Friday after Chile’s Constitutional Court rejected a lawsuit to block Chinese miner Tianqi Lithium Corp’s $4.1 billion purchase of a stake in the Chilean lithium miner. ',
'SQM’s B-series shares touched 29,400 pesos ($42.55) at the open of Santiago’s Stock Exchange. '] 

相关问题 更多 >