ValueError:找到样本数不一致的输入变量:[29675、9574、29675]

2024-04-26 10:15:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我是ML的初学者。问题是我的训练和测试数据在不同的文件中,并且长度不同,因此我得到以下错误:

   Traceback (most recent call last):
   File "C:/Users/Ellen/Desktop/Python/ML_4.py", line 35, in <module>
   X_train, X_test, y_train, y_test = 
   train_test_split(processed_features_train, processed_features_test, 
   labels, test_size=1, random_state=0)
   File "C:\Python\Python37\lib\site- 
   packages\sklearn\model_selection\_split.py", line 2184, in 
   train_test_split
   arrays = indexable(*arrays)
   File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", 
   line 260, in indexable
   check_consistent_length(*result)
   File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", 
   line 235, in check_consistent_length
   " samples: %r" % [int(l) for l in lengths])
   ValueError: Found input variables with inconsistent numbers of samples: 
   [29675, 9574, 29675]

我不知道如何解决这些错误。下面是我的代码:

  tweets_train = pd.read_csv('Final.csv')
  features_train = tweets_train.iloc[:, 1].values
  labels= tweets_train.iloc[:, 0].values
  vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
  processed_features_train = 
  vectorizer.fit_transform(features_train).toarray()
  tweets_test = pd.read_csv('DataF1.csv')
  features_test= tweets_test.iloc[:, 1].values.astype('U')  
  vectorizer = CountVectorizer(stop_words=stopwords.words('english')) 
  processed_features_test = 
  vectorizer.fit_transform(features_test).toarray()

  X_train, X_test, y_train, y_test = 
  train_test_split(processed_features_train, processed_features_test, 
  labels, test_size=1, random_state=0)
  text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
  #regr.fit(X_train, y_train)
  text_classifier.fit(X_train, y_train)
  predictions = text_classifier.predict(X_test)
  print(confusion_matrix(y_test,predictions))
  print(classification_report(y_test,predictions))

生产线误差为:X\u列,X\u测试,y\u列,y\u测试= 列车\u测试\u分割(已处理的\u特征\u列车,已处理的\u特征\u测试, 标签,测试大小=1,随机状态=0)

已处理的\u功能_列车形状输出为(2967528148),而, 已处理的\u功能_测试.形状输出为(957411526)

示例数据如下-(第一列为“labels”,第二列为“text”)

  neutral tap to explore the biggest change to world wars since world war 
  neutral tap to explore the biggest change to sliced bread. 
  negative apple blocked 
  neutral apple applesupport can i have a yawning emoji ? i think i am 
  asking for the 3rd or 5th time 
  neutral apple made with 20  more child labor 
  negative apple is not she the one who said she hates americans ? 

列车数据文件和测试数据文件中只有3个标签(正、负、中性)。你知道吗


Tags: csvinpytestlabelslinetraintweets
2条回答

这是因为要将三个数据集传递到train_test_split,而不是将X, y作为参数。你知道吗

  1. 因为您的测试集在一个单独的文件中,所以不需要分割数据(除非您想要一个验证集,或者测试集在竞争意义上是未标记的)。

  2. 不应该在测试数据上安装新的矢量器;这样做意味着训练集和测试集中的列之间没有连接。相反,可以使用vectorizer.transform(features_test)(与vectorizer相同的对象fit_transform生成训练数据)。

所以,试试:

tweets_train = pd.read_csv('Final.csv')    
features_train = tweets_train.iloc[:, 1].values 
labels_train = tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train = vectorizer.fit_transform(features_train).toarray() 
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
labels_test = tweets_test.iloc[:, 0].values
processed_features_test = vectorizer.transform(features_test).toarray() 

text_classifier = RandomForestClassifier(n_estimators=200, random_state=0) 
text_classifier.fit(processed_features_train, labels_train) 
predictions = text_classifier.predict(processed_features_test)
print(confusion_matrix(labels_test,predictions))
print(classification_report(labels_test,predictions))

相关问题 更多 >