python正则表达式从字符串中过滤一些html标记 - 问答 - Python中文网

python正则表达式从字符串中过滤一些html标记

2024-09-28 18:47:31 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我需要一个字符串，如下所示：

original_str="SnO<sub>2</sub>solution-based (a<100 <sup>o</sup> AAAC>u_test)abcdhhhh"

规则是

 "<" or ">" to "&lt;" or "&gt;" if they are not part of a HTML tag.
PS:The string only has <sup></sup><sub></sub> html tag

因此，处理后的字符串应该是：

process_str="SnO<sub>2</sub>solution-based (a&lt;100 <sup>o</sup> AAAC&gt;u_test)abcdhhhh"

我不知道如何使用正则表达式来处理这个问题。你知道吗

Tags： or to 字符串 test lt gt 规则 tag

1条回答

网友

1楼 · 发布于 2024-09-28 18:47:31

使用regex解析HTML不是一个好主意-有关详细信息，请参见this answer。你知道吗

而是使用容错HTML解析器来读取字符串，然后生成兼容的输出。你知道吗

In [7]: import bs4

In [8]: original_str="SnO<sub>2</sub>solution-based (a<100 <sup>o</sup> AAAC>u_test)abcdhhhh"

In [9]: soup = bs4.BeautifulSoup(original_str, 'lxml')

In [10]: print(soup)
<html><body><p>SnO<sub>2</sub>solution-based (a&lt;100 <sup>o</sup> AAAC&gt;u_test)abcdhhhh</p></body></html>

如果您只需要最初发布的片段，请使用

In [18]: soup.body.p.renderContents()
Out[18]: 'SnO<sub>2</sub>solution-based (a&lt;100 <sup>o</sup> AAAC&gt;u_test)abcdhhhh'

相关问题更多 >

编程相关推荐

热门问题

热门文章