回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我正在scikit learn中学习随机林,作为一个例子,我想使用随机林分类器对文本进行分类,使用我自己的数据集。因此,首先我使用tfidf将文本矢量化,并进行分类:</p>
<pre><code>from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(X_train, y_train)
prediction = classifier.predict(X_test)
</code></pre>
<p>当我进行分类时,我得到了这个:</p>
<pre><code>TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
</code></pre>
<p>然后我用<code>.toarray()</code>表示<code>X_train</code>,得到了以下结果:</p>
<pre><code>TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
</code></pre>
<p>根据我的理解,从前面的<a href="https://stackoverflow.com/questions/21689141/classifying-text-documents-with-random-forests">question</a>开始,我需要降低numpy数组的维数,所以我也这样做:</p>
<pre><code>from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(X_reduced_train, y_train)
prediction = classifier.predict(X_testing)
</code></pre>
<p>然后我发现了一个例外:</p>
<pre><code> File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
n_samples = len(X)
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__
raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
</code></pre>
<p>我尝试了以下操作:</p>
<pre><code>prediction = classifier.predict(X_train.getnnz())
</code></pre>
<p>得到这个:</p>
<pre><code> File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
n_samples = len(X)
TypeError: object of type 'int' has no len()
</code></pre>
<p>由此提出了两个问题:如何使用随机森林进行正确分类?那<code>X_train</code>怎么了?。</p>
<p>然后我尝试了以下方法:</p>
<pre><code>df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])
X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values
from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)
a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)
from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)
</code></pre>