擅长:python、mysql、java
<p>使用正则表达式。在“[not.]后跟“.”的5个匹配项之后添加\n</p>
<pre><code>import re
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
print(re.sub(r'((?:[^.]+\.\s*){5})',r'\1\n',text))
</code></pre>
<p>更高级的正则表达式句子匹配器,通过匹配结尾标点来处理缩写和其他标点。<br/>
参考资料:<a href="https://mikedombrowski.com/2017/04/regex-sentence-splitter/" rel="nofollow noreferrer">https://mikedombrowski.com/2017/04/regex-sentence-splitter/</a><br/>
注意:仍有一些边缘情况无法使用此选项,例如T.V.后面跟着Mr.需要双空格来表示单独的句子。带有句子的引文将被拆分。等等</p>
<pre><code>import re
sentence_regex = r'((.*?([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)){5})'
text = 'The puppy is cute. Watch T.V. Mr. Summers is great. Say "my name." My name is. Or not... Happy friday? Sentence4. Sentence5. Sentence6. Sentence7.'
text += " " + text
print(re.sub(sentence_regex,r'\1\n',text))
</code></pre>
<p>任何比这更复杂的东西,您都可能需要查看语言处理工具包</p>