_联合日志可能给我错误的值

2024-10-01 22:40:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样的代码

x_train=data['TOKEN'].loc[:2]
y=data['label'].loc[:2]
x_test=data['TOKEN'].loc[3:]

包含3个数据训练类,每个类1个类(-1)、(0)、(1) 和1个数据测试

#TFIDF training
tfidf= TfidfVectorizer(smooth_idf=False,norm=None)
x_tfidf2 = tfidf.fit_transform(x_train)
tfidfframe_train = pd.DataFrame(x_tfidf_train,columns=tfidf.get_feature_names())
#the output of tfidfframe_train 
    a        b       c     d        e       f
0   0.0     0.0      0.0    1.477   1.477   1.0 -> class -1 data train doc1
1   0.0     0.0      1.176  0.0     0.0     1.0  -> class 0 data train doc2
2   1.477   1.477   1.176   0.0     0.0     1.0  -> class 1 data train doc3

#TFIDF testing
x_tfidf3 = tfidf.transform(x_test)
tfidfframe_test = pd.DataFrame(x_tfidf_test,columns=tfidf.get_feature_names())
    a     b    c     d    e    f
0   0.0  0.0  1.17  0.0  0.0  1.0

现在我们知道在我们的数据测试中有c和f两个词 我将数据拟合为多项式nb

from sklearn.naive_bayes import MultinomialNB
model =MultinomialNB(alpha=1.0)
classifier = model.fit(x_tfidf_chi2_train,y)
print ('class log prrior \n',classifier.class_log_prior_)
#output (logbase10)
class log prrior #(logbase10 1/3) = -0.47712125 this output is correct
 [-0.47712125 -0.47712125 -0.47712125]
print('Conditional Probabilities :\n',classifier.feature_log_prob_) # count Conditional Prob with P(w|c)
#output #this output actually correct. this count by input the TFIDF values above in data train to logbase10 of P(w|c) calculation
     a            b           c              d         e           f
[[-0.99800822 -0.99800822 -0.99800822 -0.60406095 -0.60406095 -0.69697822] -> class -1 data train doc1
 [-0.91254573 -0.91254573 -0.57486863 -0.91254573 -0.91254573 -0.61151573] -> class 0 data train doc2
 [-0.65256092 -0.65256092 -0.70883108 -1.04650819 -1.04650819 -0.74547819]] -> class 1 data train doc3

现在的问题是,当我试图计算测试数据的类最大对数时,它应该是 sklearn中的P(c)+P(w | c)由_联合(u log)似然所知

所以我们可以通过预测单词[cf]来手动计算

     c            e         logbase10P(c)
-0.99800822 + -0.69697822 + -0.47712125 = -2.17210769 -> class -1 
-0.57486863 + -0.61151573 + -0.47712125 = -1.66350558 -> class 0 
-0.70883108 + -0.74547819 + -0.47712125 =  -1.92552177 -> -> class 1

但是当我试图通过系统输出它时,输出不匹配

jll = classifier._joint_log_likelihood(x_test) 
output sorted left to right (-1,0,1)
     class -1  class 0     class 1
[[-2.34784822 -1.76473496 -2.05624949]]

多项式有什么问题? 联合日志的可能性? 关于多项式nB的naive_bayes.py证明 密码说

 def _joint_log_likelihood(self, X):
        """Calculate the posterior log probability of the samples X"""
        return (safe_sparse_dot(X, self.feature_log_prob_.T) +
                self.class_log_prior_)

也许你可以复习一下,告诉我这是数据 Data 希望你们能回答


Tags: ofthe数据testlogoutputdatatrain

热门问题