<p>你列举样本的方式没有意义。把它们打印出来就可以了:</p>
<pre><code>>>> import pandas as pd
>>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
... 'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
>>> samples = [dict(enumerate(sample)) for sample in train]
>>> samples
[{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}]
</code></pre>
<p>现在这是一个语法上的单词列表,但与你所期望的完全不同。请尝试执行以下操作:</p>
<pre><code>>>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
>>> train_as_dicts
[{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'},
{'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'},
{'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}]
</code></pre>
<p>这看起来好多了,现在让我们试着把这些指令矢量化:</p>
<pre><code>>>> from sklearn.feature_extraction import DictVectorizer
>>> vectorizer = DictVectorizer()
>>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts)
>>> vectorized_sparse
<3x7 sparse matrix of type '<type 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
>>> vectorized_array = vectorized_sparse.toarray()
>>> vectorized_array
array([[ 1., 0., 0., 1., 0., 1., 0.],
[ 0., 1., 1., 0., 1., 1., 0.],
[ 1., 0., 1., 1., 0., 0., 1.]])
</code></pre>
<p>要了解每一列的含义,请询问矢量器:</p>
<pre><code>>>> vectorizer.get_feature_names()
['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f']
</code></pre>