获取造成列车和测试集列不匹配的假人吗?

2024-10-04 05:24:20 发布

您现在位置:Python中文网/ 问答频道 /正文

正在处理来自kaggle的泰坦尼克号数据集,在尝试处理分类变量和处理NA值时遇到问题。以下代码产生此错误:

import pandas as pd
from sklearn.preprocessing import Imputer

nonpredictors = ["Ticket", "Cabin", "Survived"]
nonpredictors_test = ["Ticket", "Cabin"]

training_df = pd.read_csv("train.csv").set_index(["PassengerId"])
training_df_concat = training_df.drop(nonpredictors, axis = 1)
testing_df = pd.read_csv("test.csv").set_index(["PassengerId"])
testing_df_concat = testing_df.drop(nonpredictors_test, axis = 1)
chunks = [training_df_concat, testing_df_concat]
all_data = pd.concat(chunks, ignore_index = True)
catVar = ["cabin_level", "Embarked", "Sex"]
all_data = pd.get_dummies(all_data, columns = catVar)
imp = Imputer(strategy = 'median')
imp.fit(all_data)

training_df = pd.get_dummies(training_df, columns = catVar)
train_x = training_df.drop(nonpredictors, axis = 1)
train_x = imp.transform(train_x)
train_y = training_df["Survived"]

testing_df = pd.get_dummies(testing_df, columns = catVar)
test_x = testing_df.drop(nonpredictors_test, axis = 1)
test_x = imp.transform(test_x)

ValueError: X has 20 features per sample, expected 21

我将数据集连接起来,并将插补器拟合到结果组合集中,以避免我将其拟合到包含测试集中未观察到的值的训练集上的情况。在

似乎我得到了列与测试集之间的列不匹配,但我不知道在哪里。有什么想法吗?提前谢谢。在


Tags: csvtestdfdatatrainingtrainalltesting