如何为重复分隔符中的各种字符串构造正则表达式匹配?

2024-05-17 19:43:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个字符串,格式如下:

GENESIS 1:1 In the beginning God created the heavens ... the ground. 2:7 And the LORD ... I buried Leah. 49:32 The purchase of the field and of the cave ... and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...

仅使用一个正则表达式,我希望匹配为

  1. 书名
  2. 章节编号(如上1、2、49、1)
  3. 诗句编号(如上1、7、32、1)
  4. 诗句本身
    • 以第一个为例:

    (GENESIS)g1 (1)g2:(1)g3 (In the beginning God created the heavens ...)g4

这要求我单独匹配数字对冒号内的所有内容,同时保留其他组,并限制固定长度的lookaheads/lookbehinds。最后一部分是证明困难的部分

到目前为止,我的表达式是(%(BOOK1)s) (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$),其中BOOK1和BOOK2在遍历预定列表时会发生变化$出现是因为最后一本书后面不会有BOOK2。我在整个字符串上对这个表达式调用re.finditer(),然后遍历match对象以生成我的组

我的表达式的功能部分目前是(\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$),但它本身实际上总是将GENESIS视为BOOK1,并匹配从^之后到BOOK2的所有内容

或者,保持我的完整表达式(%(BOOK1)s) (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$)不变只会返回第一个所需的匹配

我感觉我的一些贪婪/非贪婪术语的格式不正确,或者我可以更好地使用前导/尾随表达式。如有任何反馈,我们将不胜感激


Tags: andofthe字符串ingenesis表达式格式
2条回答

一种选择是使用Python PyPi regex module\G

第1组包括书名和章节号,第2组、第3组和第4组为下面的章节号

循环结果,您可以检查组的存在

\b(?:([A-Z]{2,})(?= \d+:\d)|\G(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+|\d++(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))*)

解释

  • \b单词边界
  • (?:非捕获组
    • ([A-Z]{2,})(?= \d+:\d)捕获组1,匹配2个或多个大写字符,并断言直接位于右侧的是空格、1+个数字:和一个数字
    • |
    • \G(?!^)在上一个匹配的末尾而不是开始处断言位置
  • )闭合群
  • (?:非捕获组
    • (\d+):(\d+)组2和组3中捕获1个或多个数字
  • )?\s*关闭组,使其成为可选的,并匹配可选的空白字符
  • (捕获第4组
    • (?:非捕获组
      • [^\dA-Z]+匹配除数字或a-Z以外的任何字符的1+倍
      • |
      • \d++(?!:\d)以所有格的方式匹配1+个数字,并断言右边的不是:后跟一个数字
      • |
      • [A-Z](?![A-Z]+ \d+:\d)匹配字符a-Z并断言直接位于右侧的不是1+字符a-Z、空格、1+数字:和数字
    • )*关闭组并重复0+次
  • )关闭组4

Regex demoPython demo

比如说

import regex

pattern = r"\b(?:([A-Z]{2,})(?= \d+:\d)|\G(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+|\d++(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))*)"
s = ("GENESIS 1:1 In the beginning God created the heavens ... the ground. 2:7 And the LORD ... I buried Leah. 49:32 The purchase of the field and of the cave ... and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...\n")
matches = regex.finditer(pattern, s)

for matchNum, match in enumerate(matches, start=1):
    if (match.group(1)):
        print(f"Book name: {match.group(1)}")
        print("               ")
    else:
        print(f"Chapter Nr: {match.group(2)}\nVerse Nr: {match.group(3)}\nThe verse: {match.group(4)}\n")

输出

Book name: GENESIS
               
Chapter Nr: 1
Verse Nr: 1
The verse: In the beginning God created the heavens ... the ground. 

Chapter Nr: 2
Verse Nr: 7
The verse: And the LORD ... I buried Leah. 

Chapter Nr: 49
Verse Nr: 32
The verse: The purchase of the field and of the cave ... and he was put in a coffin in Egypt. 

Book name: EXODUS
               
Chapter Nr: 1
Verse Nr: 1
The verse: Now these are the names ...

我用纯python提出了一个re解决方案。多亏了上面的回答,我才能够走上正轨。事实证明,我试图通过测试LORD 2:8 ...来使用的扳手实际上并不是问题,因为在整个字符串中[A-Z]\d之间没有标点符号的情况下,非标题大写字母从来不会以这种方式出现在数字之前

使用与派生模式相同的示例:

import re

pattern = r"(?:([A-Z]{2,})(?= \d+:\d)|(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))+)"
s = ("GENESIS 1:1 In the beginning God created the heavens ... the ground. 2:7 And the LORD ... I buried Leah. 49:32 The purchase of the field and of the cave ... and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...\n")
match = re.finditer(pattern, s)
for matchNum, match in enumerate(matches, start=1):
    if (match.group(1)):
        print(f"Book name: {match.group(1)}")
        print("               ")
    else:
        print(f"Chapter Nr: {match.group(2)}\nVerse Nr: {match.group(3)}\nThe verse: {match.group(4)}\n")

regex一样,输出是:

Book name: GENESIS
               
Chapter Nr: 1
Verse Nr: 1
The verse: In the beginning God created the heavens ... the ground.

Chapter Nr: 2
Verse Nr: 7
The verse: And the LORD ... I buried Leah.

Chapter Nr: 49
Verse Nr: 32
The verse: The purchase of the field and of the cave ... and he was put in a coffin in Egypt.

Book name: EXODUS
               
Chapter Nr: 1
Verse Nr: 1
The verse: Now these are the names ...

相关问题 更多 >