Python正则表达式从旧HTML抓取段落

We have no need to fear the future." So said bishop-elect H. George Anderson at a news conference immediately following his election as bishop of the Evangelical Lutheran Church in America. "[The future] belongs to God, untouched by human hands." At the beginning of a new ministry of leadership and pastoral oversight, such words from a bishop are obviously designed to project confidence and a profound sense of trust in the mission of the Church. They are words designed to inspire and empower the people of God for ministry.<o:p></o:p>

2条回答

网友

1楼 · 编辑于 2024-09-30 01:22:44

虽然（正如有人评论的那样）你不应该像这样解析HTML，但是对于这种一次性的工作，这种解决方案可能会奏效。在

您的regex不适用于第一段，因为.与换行符不匹配，并且您的标记中有一个换行符。您可以使用[\S\s]等技巧来匹配所有字符，包括换行符。在

这篇文章并没有删除段落末尾的标签，但我希望它还是有帮助的：

for g1, g2, content in re.findall("<p (class=bodyDC|class=BODY)[^><]*>(<[\S\s]*?>)*([\S\s]*?)<\\/p>", str1):
    print content

解释一下：

<]*>匹配开头段落标记
<p：标签的开头
(class=bodyDC|class=BODY)：两个类属性之一
[^><]*：标记内的任何其他属性
>：标记的结尾

{8个匹配的<：标记的开头
[\S\s]*?：任何其他属性（也可以使用[^><]*）
>：标记结束

([\S\s]*?)匹配任何文本。这是第三组，基本上是内容。（加上末尾的标签。）

<\/p>匹配结束段落标记。（请注意，在代码中它实际上显示为<\\/p>，因为反斜杠必须在python字符串中转义。）

网友
2楼 · 编辑于 2024-09-30 01:22:44

我将采取两步走的方法来解决这个问题。在
首先收集所有感兴趣的段落
第二，从每一段中摘录课文
第一个
解析出所有具有所需类的段落。在
<p\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=(['"]?)(?:body|bodydc)\1(?:\s|>)(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)*(?=<\/p>)
此正则表达式将执行以下操作：
查找给定类的所有段落标记，直到但不包括结束符
避免一些奇怪的边缘情况问题，如 ">
由于regex的限制，这将不适用于嵌套段落标记，如outside paragraphinside paragraphmore text in the outside
见Live Demo
第二个
从每个段落中提取原始文本
(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)
此正则表达式将执行以下操作：
匹配原始文本和标记
将原始文本放入捕获组1
避免出现困难的边缘情况
见Live Demo

第一个

第二个

相关问题更多 >

编程相关推荐

热门问题

热门文章