用正则表达式解析转录本

2024-09-28 21:23:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文本格式类似于此示例:

PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo >ligula eget dolor.

LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient >montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque >eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, >fringilla vel, aliquet nec, vulputate eget, arcu.

EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis >vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. >Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. >Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.

PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, >tellus.

以及一个正则表达式来将脚本解析为对话框。你知道吗

[A-Z]+([:]|[ ]{1}[[][A-Z]*[]])

我试图捕获所有蝗虫,以便正则表达式匹配

"PAUL:", 
"LEONARD [some context]:" 

正如你所看到的here我没能捕获所有的蝗虫。你知道吗

EVIL NINJA [on the roof]:

我怎样才能捕捉到上面的内容呢?正则表达式是正确的方法吗?你知道吗

编辑:所有演讲者的名字都用大写字母表示,并以冒号结尾。这就是我处理的所有笔录的格式。你知道吗


Tags: neceudolorpaulleonardfelisdoneceget
3条回答

正则表达式的问题是它不允许任何空格,所以它与“邪恶忍者”或“屋顶上”不匹配。你知道吗

但是是的,regex绝对是正确的方法。你可以试试这个:

([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:

用法:

regex = r'([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:'

for match in re.finditer(regex, text):
    print('person:', match.group(1))
    print('context:', match.group(2))
    print()

输出:

person: PAUL
context: None

person: LEONARD
context: None

person: EVIL NINJA
context: on the roof

person: PAUL
context: SCREAMING
[A-Z ]+(:|\[[a-zA-Z ]+\]:)

我想你错的是你没有匹配[]中的小写字母,所以[on the roof]不匹配。我已经将a-z添加到character类中,现在它匹配了。另外,您不允许在角色名称中使用空格,因此我将开始改为[A-Z ]。你知道吗

try it here!

正则表达式

"^([A-Z\s]+)(?:\[(?:[\w ]+)\])?:(.*?)$"
  • A-Z可以更改为\w
  • 要获取上下文,应将(?:[\w ]+)更改为([\w ]+)

代码

import re

regex = r"^([A-Z\s]+)(?:\[(?:[\w ]+)\])?:(.*?)$"

test_str = ("PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. \n\n"
        "LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. \n\n"
        "EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. \n\n"
        "PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. ")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

输出

Match 1 was found at 0-100: PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.     
Group 1 found at 0-4: PAUL
Group 2 found at 5-97:  Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.

Match 2 was found at 100-381: LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. 
Group 1 found at 100-107: LEONARD
Group 2 found at 108-378:  Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu.

Match 3 was found at 381-684: EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.     
Group 1 found at 381-392: EVIL NINJA 
Group 2 found at 406-681:  In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.

Match 4 was found at 684-767: PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. 
Group 1 found at 684-689: PAUL 
Group 2 found at 701-767:  Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus.

相关问题 更多 >