如何在beauthulsoup中添加删除标记周围的空间

from BeautifulSoup import BeautifulSoup html = '''<div class="thisText"> Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div> <div class="thisText"> In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a> </div>''' soup = BeautifulSoup(html) all_poems = soup.findAll("div", {"class": "thisText"}) for poems in all_poems: print(poems.text)

3条回答

网友

1楼 · 编辑于 2024-10-03 00:26:14

这里有一个可选的lxml及其xpath函数来搜索所有文本节点：

from lxml import etree

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''

root = etree.fromstring(html, etree.HTMLParser())
print(' '.join(root.xpath("//text()")))

它产生：

^{pr2}$

网友

2楼 · 编辑于 2024-10-03 00:26:14

一种选择是查找所有文本节点并用空格将它们连接起来：

" ".join(item.strip() for item in poems.find_all(text=True))

另外，您使用的是beautifulsoup3包，该包已过时且未维护。升级到^{}：

^{pr2}$

并替换：

from BeautifulSoup import BeautifulSoup

有：

from bs4 import BeautifulSoup

网友

3楼 · 编辑于 2024-10-03 00:26:14

beautifoulsoup4中的get_text()有一个名为separator的可选输入。您可以按如下方式使用它：

soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')

相关问题更多 >

编程相关推荐

热门问题

热门文章