使用Python在节标题上使用Regex匹配将文本文档拆分为节问题的回答

使用Python在节标题上使用Regex匹配将文本文档拆分为节

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我的文档中有一些用标题很好地表示的部分。我想用这些标题把文件分成几个部分。示例： <pre><code>1.1 Lorem Ipsum Blah blah blah 9 (page break, never will have a period in it though) Bleh bleh bleh as referenced in Section 1.3 hey hey hey 1.2 Lorem Ipsumus Blah blah blah </code></pre> 我想要一个正则表达式，可以采取标题和文本，直到下一个标题出现。所以这个例子的期望结果是 ^{pr2}$ 以及 <pre><code>1.2 Lorem Ipsumus Blah blah blah </code></pre> 有一件事我总是可以指望的是，部分标题将是一个新的行，以某种数字x.x开头，后面跟着几个单词，因为这是标题的独特之处，所以我想搜索它。在 基本上，如果我看到一个新的行，形式是“1.2节定义”，我知道这是一个新的部分，我想从那里抓取所有的文本，直到下一行以“1.3节示例”或“2.1节术语”开头。章节标题总是以新行开头，格式为“第1.3节示例”、“第1.3条示例”或“1.3示例”。在 有时在一行中间会提到标题，我想忽略这些。这可以在示例中看到。在 有人知道怎么做吗？最好是在python中，但是如果regex不足够的话，那么regex就足够了。在 p.s.是否保留页码是可选的，但是regex最好不会基于页码创建新的部分 <hr/> 编辑：到目前为止，这是我运行的MWE。不完全在那里。在 <pre><code>import re doc_splitter = re.compile(r"(?<=\n)(?P<secname>[\w]+ )(\d+\.\d+ .*?)(?<=\n)(?P<secname2>[\w]+ )(?=\d+\.\d+|\Z)", re.DOTALL) text = """ Section 1.1 Lorem Ipsum Blah blah blah 9 Bleh bleh bleh Section 1.1 hey hey hey Section 1.2 Lorem Ipsumus ref Section 1.3 Blah blah blah Section 1.3 hey hey Section 1.4 """ for match in doc_splitter.finditer(text): print([match.group()]) </code></pre> 理想情况下，它会返回： <pre><code>['Section 1.1 Lorem Ipsum Blah blah blah 9 Bleh bleh bleh Section 1.1 hey hey hey'] ['Section 1.2 Lorem Ipsumus ref Section 1.3 Blah blah blah'] ['Section 1.3 hey hey'] ['Section 1.4'] </code></pre> 但它却返回： <pre><code>['Section 1.1 Lorem Ipsum\n\nBlah blah blah\n9\nBleh bleh bleh Section 1.1 hey hey hey\n\nSection '] ['Section 1.3 hey hey\n\nSection '] </code></pre> 谢谢大家的帮助！如果有人对如何解决最后一个问题有任何想法，我们将不胜感激。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

使用Python在节标题上使用Regex匹配将文本文档拆分为节

1 个回答

相关Python问题