如何在Python/Sklearn中进行正确的插补

2024-06-03 02:15:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据。注意年龄已经变大了。我的目标是正确计算所有列

+----+-------------+----------+--------+------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age  | SibSp | Parch |  Fare   |
+----+-------------+----------+--------+------+-------+-------+---------+
|  0 |           1 |        0 |      3 | 22.0 |     1 |     0 | 7.2500  |
|  1 |           2 |        1 |      1 | 38.0 |     1 |     0 | 71.2833 |
|  2 |           3 |        1 |      3 | 26.0 |     0 |     0 | 7.9250  |
|  3 |           4 |        1 |      1 | 35.0 |     1 |     0 | 53.1000 |
|  4 |           5 |        0 |      3 | 35.0 |     0 |     0 | 8.0500  |
|  5 |           6 |        0 |      3 | NaN  |     0 |     0 | 8.4583  |
+----+-------------+----------+--------+------+-------+-------+---------+

我有一个计算所有列的工作代码。结果如下。结果看起来有问题

+----+-------------+----------+--------+-----------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass |    Age    | SibSp | Parch |  Fare   |
+----+-------------+----------+--------+-----------+-------+-------+---------+
|  0 | 1.0         | 0.0      | 3.0    | 22.000000 | 1.0   | 0.0   | 7.2500  |
|  1 | 2.0         | 1.0      | 1.0    | 38.000000 | 1.0   | 0.0   | 71.2833 |
|  2 | 3.0         | 1.0      | 3.0    | 26.000000 | 0.0   | 0.0   | 7.9250  |
|  3 | 4.0         | 1.0      | 1.0    | 35.000000 | 1.0   | 0.0   | 53.1000 |
|  4 | 5.0         | 0.0      | 3.0    | 35.000000 | 0.0   | 0.0   | 8.0500  |
|  5 | 6.0         | 0.0      | 3.0    | 2.909717  | 0.0   | 0.0   | 8.4583  |
+----+-------------+----------+--------+-----------+-------+-------+---------+

我的代码如下:

import pandas as pd
import numpy as np

#https://www.kaggle.com/shivamp629/traincsv/downloads/traincsv.zip/1
data = pd.read_csv("train.csv")

data2 = data[['PassengerId', 'Survived','Pclass','Age','SibSp','Parch','Fare']].copy()

from sklearn.preprocessing import Imputer

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
data2_im = pd.DataFrame(fill_NaN.fit_transform(data2), columns = data2.columns)

data2_im

真奇怪年龄是2.909717。有没有一个适当的方法来做简单的平均插补。我可以一列一列地做,但我不清楚语法/方法。谢谢你的帮助


Tags: 代码importidageasnanpd年龄
3条回答

问题的根源在于:

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)

,这意味着你平均数超过行(橘子和苹果)

尝试将其更改为:

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0) # axis=0

你就会有预期的行为

strategy='median'可能会更好,因为它对异常值非常强大:

fill_NaN = Imputer(missing_values=np.nan, strategy='median', axis=0)

试试看

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0)

或者

data2.fillna(data2.mean())

问题是你用错了轴。正确的代码应为:

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0)

注意axis=0

相关问题 更多 >