Javascript或Python:每个senten后换行

2024-09-28 19:24:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我很好奇是否有一个python或javascript库来标记一串句子中的句子,并在每个句子上加上新行?在

即:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum aliquet leo in urna hendrerit placerat. Donec adipiscing dignissim adipiscing. Duis adipiscing mollis cursus. Etiam fringilla elit nec enim sagittis a auctor nisi gravida. Nunc sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat. Suspendisse a consequat turpis. Morbi eget ante leo, a dignissim mi.

^{pr2}$

Tags: 标记javascript句子miipsumloremleodolor
3条回答

你在找一个自然语言库。在

对于Python,有Natural Language Toolkit(NLTK)。例如,您可以查看^{}。在

The PunktSentenceTokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the taret language before it can be used. The algorithm for this tokenizer is described in Kiss & Strunk (2006):

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

如果您只是在寻找能够做到这一点的javascript,那么可以这样做:

var str = "Lorem ipsum 4.00 dolor sit amet, consectetur adipiscing elit. Vestibulum aliquet leo in urna hendrerit placerat. Donec adipiscing dignissim adipiscing. Duis adipiscing mollis cursus. Etiam fringilla elit nec enim sagittis a auctor nisi gravida. Nunc etc.... sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat. Suspendisse a consequat turpis. Morbi eget ante leo, a dignissim mi."

str = str.replace(/(\S\.)\s*([A-Z])/g, "$1\n$2");

你可以看到它在这里工作:http://jsfiddle.net/jfriend00/NR5Nc/。在

这种特殊的算法只在非空白后加句点后加空格后加大写字母的换行符。所以,它不受$4.00和{}之类的东西的影响,它们实际上并不结束行。它对行之间的空白量也很灵活。在

在Python中,使用结构更换()

>>> s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum aliquet leo in urna hendrerit placerat. Donec adipiscing dignissim adipiscing. Duis adipiscing mollis cursus. Etiam fringilla elit nec enim sagittis a auctor nisi gravida. Nunc sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat. Suspendisse a consequat turpis. Morbi eget ante leo, a dignissim mi."
>>> print s.replace('. ', '.\n')
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Vestibulum aliquet leo in urna hendrerit placerat.
Donec adipiscing dignissim adipiscing.
Duis adipiscing mollis cursus.
Etiam fringilla elit nec enim sagittis a auctor nisi gravida.
Nunc sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat.
Suspendisse a consequat turpis.
Morbi eget ante leo, a dignissim mi.

另外,你会对textwrap module感兴趣。在

相关问题 更多 >