从HTML中删除标记,特定标记除外(但保留其内容)

2024-06-03 02:45:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用这段代码删除HTML中的所有标记元素。我需要保持<br><br/>。 所以我用这个代码:

import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString)
print(MyString)

输出为:

aaaRadio and<BR> television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

结果是对的,但现在我想保留<p></p><br><br/>。你知道吗

如何修改代码?你知道吗


Tags: andthe代码inbrreworldtoday
3条回答

我不确定这里的regex是不是合适的解决方案,但既然你问了:

import re
html = html.replace("<p>", "{p}").replace("</p>", "{/p}")
txt = re.sub("<[^>]*>", "", html)
txt = txt.replace("{p}", "<p>").replace("{/p}", "</p>")

实际上,我将p标记更改为另一个标记,并在删除所有标记后重新替换。你知道吗

一般来说,用regex解析html不是一个好主意。你知道吗

使用HTML解析器比使用regex健壮得多。Regex不应该被用来解析像HTML这样的嵌套结构。你知道吗

下面是一个工作实现,它迭代所有HTML标记,对于那些不是pbr的人,将它们从标记中去除:

from bs4 import BeautifulSoup

mystring = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'

soup = BeautifulSoup(mystring,'html.parser')
for e in soup.find_all():
    if e.name not in ['p','br']:
        e.unwrap()
print(soup)

输出:

aaa<p>Radio and<br/> television.<br/></p><p>very<br> popular in the world today.</br></p><p>Millions of people watch TV. </p><p>That’s because a radio is very small 98.2%</p><p>and it‘s easy to carry. haha100%</p>bb

现在我知道怎么做了修改。但是缺少第一个<p>。你知道吗

我的代码:

import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
# MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString)
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>(<\/?p>)|<[^>]*>',r'\1\2', MyString)
print(MyString)

输出为:

aaaRadio and<BR> television.<br><p>very<br/> popular in the world today.<p>Millions of people watch TV. <p>That’s because a radio is very small 98.2%</p>and it‘s easy to carry. haha100%</p>bb

相关问题 更多 >