<p>让我们深入研究代码=)</p>
<p>首先,<code>nltk.book</code>代码驻留在<a href="https://github.com/nltk/nltk/blob/develop/nltk/book.py" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/book.py</a>上</p>
<p>如果我们仔细看一下,文本是作为<code>nltk.Text</code>对象加载的,例如<code>text6</code>来自<a href="https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36</a>:</p>
<pre><code>text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")
</code></pre>
<p><code>Text</code>对象来自<a href="https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286</a>,您可以从<a href="http://www.nltk.org/book/ch02.html" rel="nofollow noreferrer">http://www.nltk.org/book/ch02.html</a>了解如何使用它</p>
<p><code>webtext</code>是来自<code>nltk.corpus</code>的语料库,因此要获得<code>nltk.book.text6</code>的原始文本,可以直接加载webtext,例如</p>
^{pr2}$
<p>只有当您加载一个<code>PlaintextCorpusReader</code>对象时,<code>fileids</code>才会出现,而不是从<code>Text</code>对象(已处理对象)加载:</p>
<pre><code>>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
... print(filename)
...
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt
</code></pre>