正在处理来自kaggle的泰坦尼克号数据集,在尝试处理分类变量和处理NA值时遇到问题。以下代码产生此错误:
import pandas as pd
from sklearn.preprocessing import Imputer
nonpredictors = ["Ticket", "Cabin", "Survived"]
nonpredictors_test = ["Ticket", "Cabin"]
training_df = pd.read_csv("train.csv").set_index(["PassengerId"])
training_df_concat = training_df.drop(nonpredictors, axis = 1)
testing_df = pd.read_csv("test.csv").set_index(["PassengerId"])
testing_df_concat = testing_df.drop(nonpredictors_test, axis = 1)
chunks = [training_df_concat, testing_df_concat]
all_data = pd.concat(chunks, ignore_index = True)
catVar = ["cabin_level", "Embarked", "Sex"]
all_data = pd.get_dummies(all_data, columns = catVar)
imp = Imputer(strategy = 'median')
imp.fit(all_data)
training_df = pd.get_dummies(training_df, columns = catVar)
train_x = training_df.drop(nonpredictors, axis = 1)
train_x = imp.transform(train_x)
train_y = training_df["Survived"]
testing_df = pd.get_dummies(testing_df, columns = catVar)
test_x = testing_df.drop(nonpredictors_test, axis = 1)
test_x = imp.transform(test_x)
ValueError: X has 20 features per sample, expected 21
我将数据集连接起来,并将插补器拟合到结果组合集中,以避免我将其拟合到包含测试集中未观察到的值的训练集上的情况。在
似乎我得到了列与测试集之间的列不匹配,但我不知道在哪里。有什么想法吗?提前谢谢。在
目前没有回答
相关问题 更多 >
编程相关推荐