回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我尝试用python和sklearn创建一个决策树。
工作方法如下:</p>
<pre><code>import pandas as pd
from sklearn import tree
for col in set(train.columns):
if train[col].dtype == np.dtype('object'):
s = np.unique(train[col].values)
mapping = pd.Series([x[0] for x in enumerate(s)], index = s)
train_fea = train_fea.join(train[col].map(mapping))
else:
train_fea = train_fea.join(train[col])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, labels)
</code></pre>
<p>现在我试着用听写矢量器做同样的事情,但是我的代码不起作用:</p>
<pre><code>from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
train_fea = vec.fit_transform([dict(enumerate(sample)) for sample in train])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, labels)
</code></pre>
<p>最后一行有个错误:“ValueError:Number of labels=332448与Number of samples=55不匹配”。正如我从文档中了解到的,听写矢量化是为了将名义特征转换为数字特征而设计的。我做错什么了?</p>
<p>更正(感谢ogrisel督促我举一个完整的例子):</p>
<pre><code>import pandas as pd
import numpy as np
from sklearn import tree
##################################
# working example
train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
columns = set(train.columns)
columns.remove('b')
train_fea = train[['b']]
for col in columns:
if train[col].dtype == np.dtype('object'):
s = np.unique(train[col].values)
mapping = pd.Series([x[0] for x in enumerate(s)], index = s)
train_fea = train_fea.join(train[col].map(mapping))
else:
train_fea = train_fea.join(train[col])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, train['c'])
##########################################
# example with DictVectorizer and error
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
train_fea = vec.fit_transform([dict(enumerate(sample)) for sample in train])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, train['c'])
</code></pre>
<p>最后一个代码是在ogrisel的帮助下修复的:</p>
<pre><code>import pandas as pd
from sklearn import tree
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing
train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'x', 'f'],
'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
# encode labels
labels = train[['c']]
le = preprocessing.LabelEncoder()
labels_fea = le.fit_transform(labels)
# vectorize training data
del train['c']
train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
train_fea = DictVectorizer(sparse=False).fit_transform(train_as_dicts)
# use decision tree
dt = tree.DecisionTreeClassifier()
dt.fit(train_fea, labels_fea)
# transform result
predictions = le.inverse_transform(dt.predict(train_fea).astype('I'))
predictions_as_dataframe = train.join(pd.DataFrame({"Prediction": predictions}))
print predictions_as_dataframe
</code></pre>
<p>一切正常</p>