<p>你需要改变你的数据结构。这是您当前的<code>train</code>列表:</p>
<pre><code>>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
</code></pre>
<p>但问题是,每个元组的第一个元素应该是一个功能字典。因此,我将把您的列表更改为分类器可以使用的数据结构:</p>
<pre><code>>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
</code></pre>
<p>现在,您的数据的结构应该如下所示:</p>
<pre><code>>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]
</code></pre>
<p>注意,每个元组的第一个元素现在是一个字典。现在您的数据已经就位,并且每个元组的第一个元素是一个字典,您可以像这样训练分类器:</p>
<pre><code>>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
this = True neg : pos = 2.3 : 1.0
this = False pos : neg = 1.8 : 1.0
an = False neg : pos = 1.6 : 1.0
. = True pos : neg = 1.4 : 1.0
. = False neg : pos = 1.4 : 1.0
awesome = False neg : pos = 1.2 : 1.0
of = False pos : neg = 1.2 : 1.0
feel = False neg : pos = 1.2 : 1.0
place = False neg : pos = 1.2 : 1.0
horrible = False pos : neg = 1.2 : 1.0
</code></pre>
<p>如果你想使用分类器,你可以这样做。首先,从一个测试句子开始:</p>
<pre><code>>>> test_sentence = "This is the best band I've ever heard!"
</code></pre>
<p>然后,标记句子并找出句子与所有单词共享的单词。这些构成了句子的特点。</p>
<pre><code>>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}
</code></pre>
<p>现在您的功能如下:</p>
<pre><code>>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}
</code></pre>
<p>然后,您只需对这些功能进行分类:</p>
<pre><code>>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above
</code></pre>
<p>这个测试句子似乎是肯定的。</p>