<p>我认为<code>PlaintextCorpusReader</code>已经用punkt标记器分段了输入,至少如果您的输入语言是英语。</p>
<p><a href="http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader" rel="nofollow noreferrer">PlainTextCorpusReader's constructor</a></p>
<pre><code>def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
</code></pre>
<p>您可以向读取器传递单词和句子标记器,但对于后者,默认值已经是<code>nltk.data.LazyLoader('tokenizers/punkt/english.pickle')</code>。</p>
<p>对于单个字符串,将按如下方式使用标记器(解释为<a href="https://www.nltk.org/api/nltk.tokenize.html" rel="nofollow noreferrer">here</a>,请参阅第5节中的punkt标记器)。</p>
<pre><code>>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())
</code></pre>