回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>你好,我是scikit学习的新手,我正在尝试做一些文本多类分类,我正在遵循<a href="http://marcobonzanini.com/2015/01/19/sentiment-analysis-with-python-and-scikit-learn/" rel="nofollow">this</a>教程。<br/>
我的数据集有4个类<code>'fipdl', 'lna','m5s','pd'</code>,所以我得到了4个文件夹(一个用于类),每个文件夹包含120个文本文件,大约25行文本(facebook状态)。
我把90%用于培训,10%用于测试。<br/>
10%的txt文件名以“ts”开头,我正在使用这些文件进行测试。<br/>
所以我的代码是:</p>
<pre><code>import sys
import os
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
def usage():
print("Usage:")
print("python %s <data_dir>" % sys.argv[0])
if __name__ == '__main__':
if len(sys.argv) < 2:
usage()
sys.exit(1)
data_dir = sys.argv[1]
classes = ['fipdl', 'lna','m5s','pd']
# Read the data
train_data = []
train_labels = []
test_data = []
test_labels = []
for curr_class in classes:
dirname = os.path.join(data_dir, curr_class)
for fname in os.listdir(dirname):
with open(os.path.join(dirname, fname), 'r') as f:
content = f.read()
if fname.startswith('ts'):
test_data.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(content)
test_labels.append(curr_class)
else:
train_data.append(content)
train_labels.append(curr_class)
# Create feature vectors
vectorizer = TfidfVectorizer(min_df=5,
max_df = 0.8,
sublinear_tf=True,
use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
# Perform classification with SVM, kernel=rbf
classifier_rbf = svm.SVC()
t0 = time.time()
classifier_rbf.fit(train_vectors, train_labels)
t1 = time.time()
prediction_rbf = classifier_rbf.predict(test_vectors)
t2 = time.time()
time_rbf_train = t1-t0
time_rbf_predict = t2-t1
# Perform classification with SVM, kernel=linear
classifier_linear = svm.SVC(kernel='linear')
t0 = time.time()
classifier_linear.fit(train_vectors, train_labels)
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(train_vectors, train_labels)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(test_vectors)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1
# Print results in a nice table
print("Results for SVC(kernel=rbf)")
print("Training time: %fs; Prediction time: %fs" % (time_rbf_train, time_rbf_predict))
print(classification_report(test_labels, prediction_rbf))
print("Results for SVC(kernel=linear)")
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
print(classification_report(test_labels, prediction_linear))
print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(test_labels, prediction_liblinear))
</code></pre>
<p>输出:</p>
^{pr2}$
<p>现在结果似乎太好了,不可能是真的,因为每种方法都给了我1的精确度。<br/>
我想最好是尝试预测我传递的字符串而不是测试集,因为要做更多的测试,所以我将原始代码改为:</p>
<pre><code>import sys
import os
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
def usage():
print("Usage:")
print("python %s <data_dir>" % sys.argv[0])
if __name__ == '__main__':
if len(sys.argv) < 2:
usage()
sys.exit(1)
data_dir = sys.argv[1]
classes = ['fipdl', 'lna','m5s','pd']
# Read the data
train_data = []
train_labels = []
test_data = []
test_labels = []
for curr_class in classes:
dirname = os.path.join(data_dir, curr_class)
for fname in os.listdir(dirname):
with open(os.path.join(dirname, fname), 'r') as f:
content = f.read()
if fname.startswith('ts'):
test_data.append(content)
test_labels.append(curr_class)
else:
train_data.append(content)
train_labels.append(curr_class)
# Create feature vectors
vectorizer = TfidfVectorizer(min_df=5,
max_df = 0.8,
sublinear_tf=True,
use_idf=True)
string = ['string to predict'] #my string
vector = vectorizer.transform(string) #convert
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
# Perform classification with SVM, kernel=rbf
classifier_rbf = svm.SVC()
t0 = time.time()
classifier_rbf.fit(train_vectors, train_labels)
t1 = time.time()
prediction_rbf = classifier_rbf.predict(vector) #predict
t2 = time.time()
time_rbf_train = t1-t0
time_rbf_predict = t2-t1
# Perform classification with SVM, kernel=linear
classifier_linear = svm.SVC(kernel='linear')
t0 = time.time()
classifier_linear.fit(train_vectors, train_labels)
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(train_vectors, train_labels)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(test_vectors)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1
# Print results in a nice table
print("Results for SVC(kernel=rbf)")
print("Training time: %fs; Prediction time: %fs" % (time_rbf_train, time_rbf_predict))
print(classification_report(test_labels, prediction_rbf))
print("Results for SVC(kernel=linear)")
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
print(classification_report(test_labels, prediction_linear))
print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(test_labels, prediction_liblinear))
</code></pre>
<p>但它失败了</p>
<pre><code>ValueError: Found arrays with inconsistent numbers of samples: [18 44]
</code></pre>
<p>我遗漏了什么?或者这是一个完全错误的方法?<br/>
如有任何帮助,我们将不胜感激,<br/>
先谢谢尼科。在</p>