如何在beauthulsoup中添加删除标记周围的空间

2024-10-03 00:26:14 发布

您现在位置:Python中文网/ 问答频道 /正文

from BeautifulSoup import BeautifulSoup

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''


soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)

我有这个示例代码,但我无法找到如何在删除的标记周围添加空格,这样当<a href...>中的文本格式化时,它可以阅读,并且不会像这样显示:

PoemThe RavenOnce upon a midnight dreary, while I pondered, weak and weary...

In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace


Tags: ofdivcomhttphtmlclasspoemshref
3条回答

这里有一个可选的及其xpath函数来搜索所有文本节点:

from lxml import etree

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''

root = etree.fromstring(html, etree.HTMLParser())
print(' '.join(root.xpath("//text()")))

它产生:

^{pr2}$

一种选择是查找所有文本节点并用空格将它们连接起来:

" ".join(item.strip() for item in poems.find_all(text=True))

另外,您使用的是beautifulsoup3包,该包已过时且未维护。升级到^{}

^{pr2}$

并替换:

from BeautifulSoup import BeautifulSoup

有:

from bs4 import BeautifulSoup

beautifoulsoup4中的get_text()有一个名为separator的可选输入。您可以按如下方式使用它:

soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')

相关问题 更多 >