从HTML中删除标记，特定标记除外（但保留其内容）

import re MyString = 'aaaRadio and television. very popular in the world today.Millions of people watch TV. That’s because a radio is very small <span_style=":_black;">98.2%and it‘s easy to carry. <span_style=":_black;">haha100%bb' MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString) print(MyString)

3条回答

网友

1楼 · 编辑于 2024-06-03 02:45:11

我不确定这里的regex是不是合适的解决方案，但既然你问了：

import re
html = html.replace("<p>", "{p}").replace("</p>", "{/p}")
txt = re.sub("<[^>]*>", "", html)
txt = txt.replace("{p}", "<p>").replace("{/p}", "</p>")

实际上，我将p标记更改为另一个标记，并在删除所有标记后重新替换。你知道吗

一般来说，用regex解析html不是一个好主意。你知道吗

网友

2楼 · 编辑于 2024-06-03 02:45:11

使用HTML解析器比使用regex健壮得多。Regex不应该被用来解析像HTML这样的嵌套结构。你知道吗

下面是一个工作实现，它迭代所有HTML标记，对于那些不是p或br的人，将它们从标记中去除：

from bs4 import BeautifulSoup

mystring = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'

soup = BeautifulSoup(mystring,'html.parser')
for e in soup.find_all():
    if e.name not in ['p','br']:
        e.unwrap()
print(soup)

输出：

aaa<p>Radio and<br/> television.<br/></p><p>very<br> popular in the world today.</br></p><p>Millions of people watch TV. </p><p>That’s because a radio is very small 98.2%</p><p>and it‘s easy to carry. haha100%</p>bb

网友

3楼 · 编辑于 2024-06-03 02:45:11

现在我知道怎么做了修改。但是缺少第一个。你知道吗

我的代码：

import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
# MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString)
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>(<\/?p>)|<[^>]*>',r'\1\2', MyString)
print(MyString)

输出为：

aaaRadio and<BR> television.<br><p>very<br/> popular in the world today.<p>Millions of people watch TV. <p>That’s because a radio is very small 98.2%</p>and it‘s easy to carry. haha100%</p>bb

相关问题更多 >

编程相关推荐

热门问题

热门文章