我是ML的初学者。问题是我的训练和测试数据在不同的文件中,并且长度不同,因此我得到以下错误:
Traceback (most recent call last):
File "C:/Users/Ellen/Desktop/Python/ML_4.py", line 35, in <module>
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
File "C:\Python\Python37\lib\site-
packages\sklearn\model_selection\_split.py", line 2184, in
train_test_split
arrays = indexable(*arrays)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 260, in indexable
check_consistent_length(*result)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 235, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples:
[29675, 9574, 29675]
我不知道如何解决这些错误。下面是我的代码:
tweets_train = pd.read_csv('Final.csv')
features_train = tweets_train.iloc[:, 1].values
labels= tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train =
vectorizer.fit_transform(features_train).toarray()
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_test =
vectorizer.fit_transform(features_test).toarray()
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
#regr.fit(X_train, y_train)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
生产线误差为:X\u列,X\u测试,y\u列,y\u测试= 列车\u测试\u分割(已处理的\u特征\u列车,已处理的\u特征\u测试, 标签,测试大小=1,随机状态=0)
已处理的\u功能_列车形状输出为(2967528148),而, 已处理的\u功能_测试.形状输出为(957411526)
示例数据如下-(第一列为“labels”,第二列为“text”)
neutral tap to explore the biggest change to world wars since world war
neutral tap to explore the biggest change to sliced bread.
negative apple blocked
neutral apple applesupport can i have a yawning emoji ? i think i am
asking for the 3rd or 5th time
neutral apple made with 20 more child labor
negative apple is not she the one who said she hates americans ?
列车数据文件和测试数据文件中只有3个标签(正、负、中性)。你知道吗
这是因为要将三个数据集传递到
train_test_split
,而不是将X, y
作为参数。你知道吗因为您的测试集在一个单独的文件中,所以不需要分割数据(除非您想要一个验证集,或者测试集在竞争意义上是未标记的)。
不应该在测试数据上安装新的矢量器;这样做意味着训练集和测试集中的列之间没有连接。相反,可以使用
vectorizer.transform(features_test)
(与vectorizer
相同的对象fit_transform
生成训练数据)。所以,试试:
相关问题 更多 >
编程相关推荐