我正在使用BeautifulSoup加载XMl。我只需要文本,忽略标记和text
属性词
但是,我想完全排除<table><\table>
标记中的任何内容。我曾想过用正则表达式替换中间的所有内容,但我想知道是否有更干净的解决方案,部分原因是Don't parse [X]HTML with regex!。例如:
s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>.
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
clean = Soup(s)
print clean.text
将给予
Hasselt ( ) is a Belgian city and municipality.
Passenger growth
YearPassengersPercentage
1996360 000100%
19971 498 088428%
而我只想:
Hasselt ( ) is a Belgian city and municipality.
您可以找到
content
元素并从中删除所有table
元素,然后获取文本:印刷品:
相关问题 更多 >
编程相关推荐