我有这样的代码
x_train=data['TOKEN'].loc[:2]
y=data['label'].loc[:2]
x_test=data['TOKEN'].loc[3:]
包含3个数据训练类,每个类1个类(-1)、(0)、(1) 和1个数据测试
#TFIDF training
tfidf= TfidfVectorizer(smooth_idf=False,norm=None)
x_tfidf2 = tfidf.fit_transform(x_train)
tfidfframe_train = pd.DataFrame(x_tfidf_train,columns=tfidf.get_feature_names())
#the output of tfidfframe_train
a b c d e f
0 0.0 0.0 0.0 1.477 1.477 1.0 -> class -1 data train doc1
1 0.0 0.0 1.176 0.0 0.0 1.0 -> class 0 data train doc2
2 1.477 1.477 1.176 0.0 0.0 1.0 -> class 1 data train doc3
#TFIDF testing
x_tfidf3 = tfidf.transform(x_test)
tfidfframe_test = pd.DataFrame(x_tfidf_test,columns=tfidf.get_feature_names())
a b c d e f
0 0.0 0.0 1.17 0.0 0.0 1.0
现在我们知道在我们的数据测试中有c和f两个词 我将数据拟合为多项式nb
from sklearn.naive_bayes import MultinomialNB
model =MultinomialNB(alpha=1.0)
classifier = model.fit(x_tfidf_chi2_train,y)
print ('class log prrior \n',classifier.class_log_prior_)
#output (logbase10)
class log prrior #(logbase10 1/3) = -0.47712125 this output is correct
[-0.47712125 -0.47712125 -0.47712125]
print('Conditional Probabilities :\n',classifier.feature_log_prob_) # count Conditional Prob with P(w|c)
#output #this output actually correct. this count by input the TFIDF values above in data train to logbase10 of P(w|c) calculation
a b c d e f
[[-0.99800822 -0.99800822 -0.99800822 -0.60406095 -0.60406095 -0.69697822] -> class -1 data train doc1
[-0.91254573 -0.91254573 -0.57486863 -0.91254573 -0.91254573 -0.61151573] -> class 0 data train doc2
[-0.65256092 -0.65256092 -0.70883108 -1.04650819 -1.04650819 -0.74547819]] -> class 1 data train doc3
现在的问题是,当我试图计算测试数据的类最大对数时,它应该是 sklearn中的P(c)+P(w | c)由_联合(u log)似然所知
所以我们可以通过预测单词[cf]来手动计算
c e logbase10P(c)
-0.99800822 + -0.69697822 + -0.47712125 = -2.17210769 -> class -1
-0.57486863 + -0.61151573 + -0.47712125 = -1.66350558 -> class 0
-0.70883108 + -0.74547819 + -0.47712125 = -1.92552177 -> -> class 1
但是当我试图通过系统输出它时,输出不匹配
jll = classifier._joint_log_likelihood(x_test)
output sorted left to right (-1,0,1)
class -1 class 0 class 1
[[-2.34784822 -1.76473496 -2.05624949]]
多项式有什么问题? 联合日志的可能性? 关于多项式nB的naive_bayes.py证明 密码说
def _joint_log_likelihood(self, X):
"""Calculate the posterior log probability of the samples X"""
return (safe_sparse_dot(X, self.feature_log_prob_.T) +
self.class_log_prior_)
也许你可以复习一下,告诉我这是数据 Data 希望你们能回答
目前没有回答
相关问题 更多 >
编程相关推荐