回复sub即使找到regex模式也无法执行？

#!/usr/bin/env python # -*- coding: utf-8 -*- import re tstr = r''' <div class="thebibliography"> [1] <a id="Xtester"></a>Akegctor, P. D. H. testöng ... . Draftin: <a href="http://www.example.com/test.html" class="url" >http://www.example.com/test.html</a> (2001). </div> ''' # remove <a id> tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL) # remove class= in <a regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)''' print( re.findall(regstr, tout2, re.DOTALL)) # finds print("------") # print( re.sub(regstr, "AAAAAAA", tout2, re.DOTALL )) # does nothing?

------ <div class="thebibliography"> [1] Akegctor, P. D. H. testöng ... . Draftin: <a href="http://www.example.com/test.html" class="url" >http://www.example.com/test.html</a> (2001). </div>

3条回答

网友

1楼 · 编辑于 2024-09-30 20:27:47

为什么不使用HTML解析器来解析和修改HTML。你知道吗

例如，使用^{}和^{}：

from bs4 import BeautifulSoup

data = """Your html here"""
soup = BeautifulSoup(data)

for link in soup('a', id=True):
    link.replace_with('AAAAAA')

print(soup.prettify())

这将用AAAAAA文本替换所有具有id属性的链接：

<div class="thebibliography">
<p class="bibitem">
<span class="biblabel">
 [1]
 <span class="bibsp">
 </span>
</span>
AAAAAA
<span class="cmcsc-10">
...

另请参见：

RegEx match open tags except XHTML self-contained tags

网友

2楼 · 编辑于 2024-09-30 20:27:47

您的替代品由于使用不当而无法使用回复sub方法，如果您查看文档：

re.sub(pattern, repl, string, count=0, flags=0)

但是在你的代码里，你把“旗帜”放在“计数”的地方。这就是re.DOTALL标志被忽略的原因，因为它位于错误的位置。你知道吗

由于不需要使用count参数，因此可以删除re.DOTALL标志，改用内联修饰符：

regstr = r'''(?s)(<a.*?)(class=['"].*?['"])([\s]*>)'''

然而，使用类似bs4的东西可能更方便。（如@alecxe answer中所示）。你知道吗

网友

3楼 · 编辑于 2024-09-30 20:27:47

很简单：Python标准库参考说语法或re.sub是：re.sub(pattern, repl, string, count=0, flags=0)。所以你的最后一个子实际上是（如re.DOTALL==16）：

re.sub(regstr, "AAAAAAA", tout2, count = 16, flags = 0 )

当您需要时：

re.sub(regstr, "AAAAAAA", tout2, flags = re.DOTALL )

最后的潜艇工作得很好。。。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章