用于捕获科学引文的正则表达式

2024-09-30 18:16:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图抓住括号内至少有一个数字的文本(想想引文)。这是我现在的正则表达式,它工作正常:https://regex101.com/r/oOHPvO/5

\((?=.*\d).+?\)

所以我想让它捕捉(Author 2000)(2000),而不是(Author)。你知道吗

我试图使用python捕获所有这些括号,但是在python中,它也捕获括号中的文本,即使它们没有数字。你知道吗

import re

with open('text.txt') as f:
    f = f.read()

s = "\((?=.*\d).*?\)"

citations = re.findall(s, f)

citations = list(set(citations))

for c in citations:
    print (c)

你知道我做错了什么吗?你知道吗


Tags: texthttps文本importretxtcomwith
2条回答

你可以用

re.findall(r'\([^()\d]*\d[^()]*\)', s)

参见regex demo

细节

  • \(-a(字符
  • [^()\d]*-0个或更多字符,而不是()和数字
  • \d-一个数字
  • [^()]*-0个或更多字符,而不是()
  • \)-a)字符。你知道吗

参见regex graph

enter image description here

Python demo

import re
rx = re.compile(r"\([^()\d]*\d[^()]*\)")
s = "Some (Author) and (Author 2000)"
print(rx.findall(s)) # => ['(Author 2000)']

要获得不带括号的结果,请添加捕获组:

rx = re.compile(r"\(([^()\d]*\d[^()]*)\)")
                    ^                ^

this Python demo。你知道吗

处理此表达式最可靠的方法可能是在表达式可能增长时添加边界。例如,我们可以尝试创建char列表,从中收集所需的数据:

(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\)).

DEMO

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))."

test_str = "some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)"

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

演示

正则表达式电路

jex.im可视化正则表达式:

enter image description here

相关问题 更多 >