调用文本但完全排除表

2024-06-28 19:27:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用BeautifulSoup加载XMl。我只需要文本,忽略标记和text属性词

但是,我想完全排除<table><\table>标记中的任何内容。我曾想过用正则表达式替换中间的所有内容,但我想知道是否有更干净的解决方案,部分原因是Don't parse [X]HTML with regex!。例如:

s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>. 
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
clean = Soup(s)
print clean.text

将给予

Hasselt ( ) is a Belgian city and municipality. 
Passenger growth
YearPassengersPercentage 
1996360 000100%
19971 498 088428%

而我只想:

Hasselt ( ) is a Belgian city and municipality.

Tags: andtext标记city内容targetistable
1条回答
网友
1楼 · 发布于 2024-06-28 19:27:17

您可以找到content元素并从中删除所有table元素,然后获取文本:

from bs4 import BeautifulSoup

s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>.
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
soup = BeautifulSoup(s, "xml")

content = soup.content
for table in content("table"):
    table.extract()

print(content.get_text().strip())

印刷品:

Hasselt ( ) is a Belgian city and municipality.

相关问题 更多 >