Python中的正则表达式模式

2024-10-03 04:34:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在Python中的正则表达式中寻找一个模式来执行以下操作:

对于格式如下的文本:

2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!

我想返回:

[(2021-01-01,10:00:05,Surname1 Name1,Comment,Blablabla/nBlabla),
(2021-01-01,23:00:05,Surname2 SurnameBis Name2,WorkNotes,What?/nI don't know?),
(2021-01-02,03:00:05,Surname1 Name1,Comment,Blablabla!)]

我设法找到了一个安静接近的结果:

text2 = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
Can you be clear?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""
LangTag = re.findall("(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*?)\((.*)\)\\n(.*)(?:\\n|$)", text2)
print(LangTag)

但我完全坚持要让我需要的所有文本都出现。。。 enter image description here

解决方案可以是从初始文本中删除\n,但我希望避免,因为我以后需要它们。。。有什么想法吗


Tags: 文本commentwhatknowdonname1name2blabla
3条回答

我的解决方案与您的几乎相同,但将组5从.*转换为\D*,因此它将匹配所有内容,直到下一个数字

import re
text = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""
result = re.findall(r"(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*?)\((.*)\)\n(\D*)(?:\n|$)", text)
print(result)

输出:

[('2021-01-01', '10:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla\nBlabla'),
 ('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2 ', 'WorkNotes', "What?\nI don't know?"), 
 ('2021-01-02', '03:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla!')]

你可以通过解决第一个问题来解决你的问题。然后重复该解决方案直到数据结束。通过这种分而治之的策略,代码很容易理解,但可以解决更大的问题,并且可以很容易地进行扩展

import re

data = '''2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!'''.splitlines()

first_line_patt = re.compile(r'^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) - (.*)(?= \() \((.*)\)$')


def parse_block(lines, idx):
    # parse the meta line
    res = first_line_patt.findall(lines[idx])

    # get the message
    message = []
    while idx < len(lines)-1:
        line = lines[idx + 1]
        idx += 1

        # check if next line is a meta line
        if first_line_patt.match(line):
            break

        # if not, it is a message line
        message.append(line)

    res.append('\n'.join(message))
    return res, idx


idx = 0
while True:
    res, idx = parse_block(data, idx)
    if not res[0]:
        break
    print(res)

这将产生以下结果:

[('2021-01-01', '10:00:05', 'Surname1 Name1', 'Comment'), 'Blablabla\nBlabla']
[('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2', 'WorkNotes'), "What?\nI don't know?"]
[('2021-01-02', '03:00:05', 'Surname1 Name1', 'Comment'), 'Blablabla!']

您可以像这样解析数据

import re

data = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""

def parse(data):
    text = ""
    match = None
    messages = []
    for line in data.split("\n"):
        m = re.match("^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) - (.*?) \((.*?)\)$", line)
        if m:
            if match:
                msg = (match.group(1), match.group(2), match.group(3), match.group(4), text)
                messages.append(msg)
            match = m
        else:
            text += line + "\n"
    msg = (match.group(1), match.group(2), match.group(3), match.group(4), text)
    messages.append(msg)
    return messages

for message in parse(data):
    print(message)

这个输出

('2021-01-01', '10:00:05', 'Surname1 Name1', 'Comment', 'Blablabla\nBlabla\n')
('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2', 'WorkNotes', "Blablabla\nBlabla\nWhat?\nI don't know?\n")
('2021-01-02', '03:00:05', 'Surname1 Name1', 'Comment', "Blablabla\nBlabla\nWhat?\nI don't know?\nBlablabla!\n")

相关问题 更多 >