Python：基于Criteri从行正则表达式中提取句子

import re txt_list = [] with open('sample.txt', 'r') as txt: patt = r'.*}[.!?]\s?\n?|.*}.+[.!?]\s?\n?' read_txt = txt.readlines() for line in read_txt: if line == "\n": txt_list.append("\n") else: found = re.findall(patt, line) for f in found: txt_list.append(f) for line in txt_list: if line == "\n": print "newline" else: print line

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said. newline I am the {very last|last} sentence for this {instance|example}.

2条回答

网友

1楼 · 编辑于 2024-09-27 21:28:41

如果您不介意添加一个依赖项，那么NLTK库有一个sent_tokenize函数，它应该可以满足您的需要，尽管我不完全确定花括号是否会干涉。在

描述NLTK方法的论文长达40多页。检测句子边界并不是一件小事。在

网友

2楼 · 编辑于 2024-09-27 21:28:41

我得到的最直观的解决方案是这个。本质上，您需要将Dr.和Mr.标记本身视为原子。在

patt = r'(?:Dr\.|Mr\.|.)*?[.!?]\s?\n?'

它说：

Find me the least number of Mr.s, Dr.s or any character up to a puncuation mark followed by a zero or one spaces which is followed by zero or one new lines.

用在这个上面示例.txt（我加了一行）：

^{pr2}$

它提供：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
But there are no {misters|doctors} here good sir!
Help us if there is an emergency.

newline
I am the {very last|last} sentence for this {instance|example}.

相关问题更多 >

编程相关推荐

热门问题

热门文章