Python正则表达式findall没有响应

2024-09-25 02:31:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我只是遇到了一件奇怪的事情。我正在使用Open ANC作为语料库创建文本爬行原型

有些文本中re模块没有响应。如果有人能肯定re模块可以处理正则表达式的复杂性,我很好

正则表达式是preceding(?:[^A-Za-z0-9\n\r]*\w+[^A-Za-z0-9\n\r]*)+acquired

出现问题的文本是:

My claim is that Lincoln’s address expresses the same idea that was then current in Europe. Each people of common history and language constitutes a nation, and the natural form for the nation’s survival was in a state structure. The idea that Americans constituted an organic national unit explained, implicitly, why the eleven Southern states could not go their own way. As he assumed the presidency, Lincoln still spoke of the Union rather than a nation; but in the course of the debates in the decades immediately preceding, the notion of union had acquired the metaphysical qualities of nationhood. In his first inaugural address, Lincoln invoked the “bonds of affection,” and even before shots were fired on Fort Sumter in Charleston Harbor, he stressed the unbreakable ties of historical struggle:

产生问题的python代码:

import re

txt = "post text here"
regex = r"preceding(?:[^A-Za-z0-9\n\r]*\w+[^A-Za-z0-9\n\r]*)+acquired"
re.findall(regex, txt)

Tags: 模块andofthein文本rethat
1条回答
网友
1楼 · 发布于 2024-09-25 02:31:53

您的模式受catastrophic backtracking影响

以下是一种可用于输入的替代模式:

regex = r"preceding[^A-Za-z0-9\n\r]+(?:\w+[^A-Za-z0-9\n\r]+)+?acquired"

这假设必须始终至少有一个非单词字符分隔单词(否则它将只匹配一个长的、未中断的单词)

(另见:How can I recognize an evil regex?

相关问题 更多 >