从抄本中获取句子

2024-09-27 00:20:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有成绩单的档案

(name 1): (sentence)\n (<-- There can be multiples of this pattern)

(name 2): (sentence)\n (sentence)\n

等等。我需要所有的句子。到目前为止,我已经得到它的工作硬编码的名称在文件中,但我需要它是通用的。你知道吗

utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)

python3.6使用re。或者,如果有人知道如何使用spacy来实现这一点,那将是一个很大的帮助,谢谢。你知道吗

我只想在一个空语句之后获取\n,并将其放入自己的字符串中。我想我只需要抓取最后给出的磁带信息,例如,因为我想不出一种方法来区分这句话是否是某人讲话的一部分。有时,在行首和冒号之间有不止一个单词。你知道吗

模拟数据:

CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!

Tags: ofnamereyouoperatorsentencerecorderbye
2条回答

您可以使用先行表达式,该表达式在行首查找名称的相同模式,后跟冒号:

s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)

这将输出:

[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
 ('CALLER', ''),
 ('CRO', "You're welcome. Thank you.\n"),
 ('OPERATOR', 'Bye.\n'),
 ('CRO', 'Bye.\n'),
 ('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
 ('OPERATOR NEWELL', 'blah blah.\n'),
 ('GUY IN DESK', 'I speak words!')]

我将在列表理解中使用regular expressions和嵌套的for loops来获取下面代码中所示的所有句子。你知道吗

s ='''(name 1): (sentence1 here)\n (<  There can be multiples of this pattern)

(name 2): (sentence2 here)\n (sentence3 here)\n'''

[y.strip('()') for x in re.split('\(name \d+\):', s) for y in re.findall('\([^\)]+\)', x)]

>>> ['sentence1 here',
    '<  There can be multiples of this pattern',
    'sentence2 here',
    'sentence3 here']

相关问题 更多 >

    热门问题