PYTHON如何从文本fi中提取含有引文标记的句子

2024-09-28 21:08:26 发布

您现在位置:Python中文网/ 问答频道 /正文

例如,我有3个句子,比如下面的at,中间有一个句子包含引文标记(Warren and Pereira, 1982)。引文总是用括号括起来,格式如下:(~string~逗号(,)~space~number~)

He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits.

我使用正则表达式只提取中间的句子,但它保持打印所有3个句子。 结果应该是这样的:

The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).


Tags: andtheischatsystemat句子called
2条回答
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

您可以将文本拆分为一系列句子,然后选择以“)”结尾的句子。在

^{pr2}$

设置。。。2句话代表感兴趣的案例:

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."

首先,在引文在句子末尾的情况下进行匹配:

^{pr2}$

当引文不在句子末尾时匹配:

p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"

将这两种情况与“|”regex运算符结合使用:

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")

运行中:

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]

在这两种情况下,你都会得到带有引文的句子。在

一个好的资源是python正则表达式documentation和附带的regex howto页面。在

干杯

相关问题 更多 >