擅长:python、mysql、java
<p>你在找一个自然语言库。在</p>
<p>对于Python,有<a href="http://www.nltk.org/" rel="nofollow">Natural Language Toolkit</a>(NLTK)。例如,您可以查看<a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html" rel="nofollow">^{<cd1>}</a>。在</p>
<blockquote>
<p>The PunktSentenceTokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the taret language before it can be used. The algorithm for this tokenizer is described in Kiss & Strunk (2006):</p>
<p>Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
Boundary Detection. Computational Linguistics 32: 485-525.</p>
<p>The NLTK data package includes a pre-trained Punkt tokenizer for English.</p>
</blockquote>