<p>我将使用<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html" rel="nofollow noreferrer">sklearn.feature_extraction.text.TfidfVectorizer</a>,它是专门为此类任务设计的:</p>
<p><strong>演示:</strong></p>
<pre><code>In [63]: df
Out[63]:
Phrase Sentiment
0 is it good movie positive
1 wooow is it very goode positive
2 bad movie negative
</code></pre>
<p>解决方案:</p>
^{pr2}$
<p>结果:</p>
<pre><code>In [31]: r.join(df)
Out[31]:
Sentiment bad good goode wooow
0 positive 0.0 1.0 0.000000 0.000000
1 positive 0.0 0.0 0.707107 0.707107
2 negative 1.0 0.0 0.000000 0.000000
</code></pre>
<p><strong>更新:</strong>内存节省解决方案:</p>
<pre><code>from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
X = vect.fit_transform(df.pop('Phrase')).toarray()
for i, col in enumerate(vect.get_feature_names()):
df[col] = X[:, i]
</code></pre>
<p><strong>更新2:</strong><a href="https://stackoverflow.com/questions/41916560/pandas-dataframe-memory-python">related question where the memory issue was finally solved</a></p>