<p>基本上你需要:</p>
<pre><code>def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
labelid = list(classifier.classes_).index(classlabel)
feature_names = vectorizer.get_feature_names()
topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
for coef, feat in topn:
print classlabel, feat, coef
</code></pre>
<ul>
<li><p><strong><code>classifier.classes_</code></strong>访问分类器中的类标签的索引</p></li>
<li><p><strong><code>vectorizer.get_feature_names()</code></strong>是不言而喻的</p></li>
<li><p><strong><code>sorted(zip(classifier.coef_[labelid], feature_names))[-n:]</code></strong>检索给定类标签的分类器系数,然后按升序对其排序。</p></li>
</ul>
<hr/>
<p>我将使用<a href="https://github.com/alvations/bayesline">https://github.com/alvations/bayesline</a>中的一个简单示例</p>
<p>输入文件<code>train.txt</code>:</p>
<pre><code>$ echo """Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
> De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
> Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
> Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.""" > test.in
</code></pre>
<p>代码:</p>
<pre><code>import codecs, re, time
from itertools import chain
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
labelid = list(classifier.classes_).index(classlabel)
feature_names = vectorizer.get_feature_names()
topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
for coef, feat in topn:
print classlabel, feat, coef
most_informative_feature_for_class(word_vectorizer, mnb, 'bs')
print
most_informative_feature_for_class(word_vectorizer, mnb, 'pt')
</code></pre>
<p>[出局]:</p>
<pre><code>bs obećao -4.50534985071
bs pošto -4.50534985071
bs prava -4.50534985071
bs predstavlja -4.50534985071
bs prošlosedmičnom -4.50534985071
bs sjeveru -4.50534985071
bs taj -4.50534985071
bs vladavine -4.50534985071
bs će -4.50534985071
bs da -4.0998847426
pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767
</code></pre>