scikitlearn logistic回归特征重要性

2024-09-30 04:31:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在寻找一种方法来了解我在分类问题中使用的特性的影响。通过使用sklearn的logistic回归分类器(http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html),我了解到.coef}属性可以获得我想要的信息(在这个线程中也讨论过:How to find the importance of the features for a logistic regression model?)。在

矩阵的前几行:

phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True

其中第一行是头,后面是数据(在我的代码中使用预处理器的LabelEncoder将其转换为int)。在

现在,当我做

^{pr2}$

我明白了

[[ 0.84768459 -0.56344453  0.00365928  0.21441586 -1.70290447 -0.18460676
   1.6167634   0.08556331  0.02152226 -0.05111953  0.07310608 -0.073653  ]]

包含12列/元素。我对此感到困惑,因为我的数据包含13列(加上带有标签的第14列,我稍后将在代码中从标签中分离特性)。 我想知道sklearn是否希望/假设第一列是id,而实际上没有使用这个列的值?但我找不到这方面的任何信息。在

如有任何帮助,我们将不胜感激!在


Tags: ofnotexttruetypenpnot特性
1条回答
网友
1楼 · 发布于 2024-09-30 04:31:26

我不确定如何编辑我的原始问题,以便将来参考,所以我将在这里发布一个最小的示例:

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy

headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]


df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))

testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
    if index < splitIndex:
        testrows.append(row)
    else:
        trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)

classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)

我想我可能已经找到了错误的来源(谢谢@Alexey Trofimov为我指出了正确的方向)。我的代码最初包含:

^{pr2}$

这是从另一个脚本中复制的,我的矩阵中有id作为第一列,因此不想考虑这些。len(headers)-1那么,如果我理解正确的话,就是不考虑实际的标签。在真实的场景中测试这一点,删除-1会得到完美的f分数,这是有意义的,因为它只会查看实际的标签,并且总是正确地预测。。。 所以我现在把这个改成

train_features = traindf.iloc[:,0:len(headers)-1]

如代码片段中所示,现在得到13列(在X中_火车.形状,因此分类器.coef_). 我认为这解决了我的问题,但仍然没有百分之百的信服,所以如果有人能指出我上面的推理/我的代码中的错误,我会很高兴听到的。在

相关问题 更多 >

    热门问题