<p><a href="http://www.nltk.org/book/ch01.html" rel="nofollow noreferrer">nltk book</a>中的<code>text1</code>是标记(单词、标点符号)的集合,与代码示例中的<code>text1</code>是字符串(Unicode代码点的集合)不同:</p>
<pre><code>>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
</code></pre>
<p>如果您的输入确实是空格分隔的单词,那么要查找频率,请使用<a href="https://stackoverflow.com/a/29052467/4279">@Boa's answer</a>:</p>
<pre><code>freq = Counter(text_with_space_separated_words.split())
</code></pre>
<p>注:<code>FreqDist</code>是一个<code>Counter</code>,但它还定义了其他方法,如<code>.plot()</code>。</p>
<p>如果要改用<code>nltk</code>标记器:</p>
<pre><code>#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk
with open('your_text.txt') as file:
text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})
</code></pre>
<p><code>sent_tokenize()</code>将文本标记为句子。然后<code>word_tokenize</code>将每个句子标记为单词。<a href="http://www.nltk.org/api/nltk.tokenize.html" rel="nofollow noreferrer">There are many ways to tokenize text in ^{<cd6>}.</a></p>