使用Python在节标题上使用Regex匹配将文本文档拆分为节

import re doc_splitter = re.compile(r"(?<=\n)(?P<secname>[\w]+ )(\d+\.\d+ .*?)(?<=\n)(?P<secname2>[\w]+ )(?=\d+\.\d+|\Z)", re.DOTALL) text = """ Section 1.1 Lorem Ipsum Blah blah blah 9 Bleh bleh bleh Section 1.1 hey hey hey Section 1.2 Lorem Ipsumus ref Section 1.3 Blah blah blah Section 1.3 hey hey Section 1.4 """ for match in doc_splitter.finditer(text): print([match.group()])

3条回答

网友

1楼 · 编辑于 2024-07-05 14:07:22

您要查找的正则表达式可能类似于：

doc_splitter = re.compile(r"(?<=\n)(\d+\.\d+ .*?)(?<=\n)(?=\d+\.\d+|$)", re.DOTALL)

，在给定python代码的情况下，可以使用finditer在整个文档上运行：

^{pr2}$

印刷品：

['1.1 Lorem Ipsum\n\nBlah blah blah\n9 (page break, never will have a period in it though)\nBleh bleh bleh\n\n']
['1.2 Lorem Ipsumus\n\nBlah blah blah\n']

这似乎是你想要的。在

如果您以不同的方式迭代数据，您可能能够摆脱繁琐的lookaround断言，这些断言可能无法清晰地转换为其他需要恒定长度lookaround的语言。核心使用(\d+\.\d+ .*?)并强制执行完全匹配。在

替代方案

Jan的回答是好的，但我还想添加一个解决方案，在没有前瞻性条件的情况下解决问题，因为它们看起来是多余的：

import re
doc_splitter = re.compile(r"^(?:Section\ )?\d+\.\d+", re.MULTILINE)
text = """

Section 1.1 Lorem Ipsum

Blah blah blah
9
Bleh bleh bleh Section 1.1 hey hey hey

Section 1.2 Lorem Ipsumus 
ref Section 1.3

Blah blah blah

Section 1.3 hey hey

Section 1.4

"""
starts = [match.span()[0] for match in doc_splitter.finditer(text)] + [len(text)]
sections = [text[starts[idx]:starts[idx+1]] for idx in range(len(starts)-1)]
for section in sections:
    print([section])

印刷品：

['Section 1.1 Lorem Ipsum\n\nBlah blah blah\n9\nBleh bleh bleh Section 1.1 hey hey hey\n\n']
['Section 1.2 Lorem Ipsumus \nref Section 1.3\n\nBlah blah blah\n\n']
['Section 1.3 hey hey\n\n']
['Section 1.4\n\n']

regex只搜索新部分的开始部分，并且应该易于维护和扩展。我们必须通过另外一个步骤，从每个新的开始手工拆分text，这是前一部分的结尾。在

虽然regex完全可以在一个步骤中处理这种匹配，但我个人更希望它们尽可能短。它们已经够难理解了。在

网友

2楼 · 编辑于 2024-07-05 14:07:22

你可以用我的两分钱

^
(?:Section\ )?\d+\.\d+
[\s\S]*?
(?=^(?:Section\ )?\d+\.\d+|\Z)

使用verbose和multiline修饰符，请参见a demo on regex101.com。

在Python中： ^{pr2}$

网友

3楼 · 编辑于 2024-07-05 14:07:22

我建议您尝试regex101.com，它将帮助您可视化您的正则表达式。另外，documentation for re对于学习（或记住）特殊字符是如何工作的非常有用。在

在您的示例中，我将使用以下regex（带命名组）：

(?P<section_number>\d\.\d) (?P<section_title>[\w ]+)\n\n\s*(?P<body>.+?)\s*(?=\d\.\d[\w ]+|$)

分解：

对于节号和标题，我使用了以空格分隔的命名组(?P<section_number>\d\.\d)和{}。在

主体(?P<body>.+?)后面是正展望(?=\d\.\d[\w ]+|$)。这意味着当另一节即将开始或文档结束时，它将停止捕获文本。它必须是nongreedy（+?），否则您只需要打开一个部分，将文档的其余部分作为主体。在

注意：在编译或搜索匹配项时，需要启用re.DOTALL，否则该点将与新行字符不匹配。在

如果希望节标题与字符串的begging匹配，也可以在lookahead中添加一个^，但是需要启用re.MULTILINE。您还必须将末尾的$改为\Z，这样它只匹配文档的结尾，而不是每行的结尾。在

^{pr2}$

替代方案

相关问题更多 >

编程相关推荐

热门问题

热门文章