如何使用xpath python方法提取不带括号的文本？

#find all the articles by using inspect element and create blank list n=0 newslist = [] #loop through each article to find the title, subtitle, link, date and author. try and except as repeated articles from other sources have different h tags. for item in articles: try: newsitem = item.find('h3', first=True) title = newsitem.text link = newsitem.absolute_links subtitle = item.xpath('//a[@class="epigraph page-link"]//text()') author = item.xpath('//span[@class="oculto"]/span//text()') date = item.xpath('//meta[@itemprop="datePublished"]/@content') date_scrap = dt.datetime.utcnow().strftime("%d/%b/%Y") hour_scrap = dt.datetime.utcnow().strftime("%H:%M:%S") print(n, '\n', title, '\n', subtitel, '\n', link, '\n', author, '\n', date, '\n', date_scrap , '\n', hour_scrap) newsarticle = { 'title': title, 'subtitle': subtitle, 'link': link, 'autor': author, 'fecha': date, 'date_scrap': dat_scrap, 'hour_scrap': hour_scrap } newslist.append(newsarticle) n+=1 except: pass news_db = pd.DataFrame(rows) news_db.to_excel (r'db_article.xlsx', index = False, header=True) news_db.head(10)

1条回答

网友

1楼 · 发布于 2024-06-16 13:52:55

^{}方法返回找到的项目列表，例如['Author']，而不是'Author'，就像item.find，它在搜索多个元素（例如['Author1', 'Author2']）时很有用。要仅获取一个值，请使用first参数：

subtitle = item.xpath('//a[@class="epigraph page-link"]//text()', first=True)
author = item.xpath('//span[@class="oculto"]/span//text()', first=True)
date = item.xpath('//meta[@itemprop="datePublished"]/@content', first=True)

absoule_links是可能aset，您可以使用

link = next(iter(newsitem.absolute_links))
# or
link = newsitem.absolute_links.pop()

相关问题更多 >

编程相关推荐

热门问题

热门文章