如何根据各种条件方法插值Pandas中的缺失值

2024-09-21 03:17:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下所示的数据帧:

import pandas as pd, numpy as np

df = pd.DataFrame({'Var1':[1.1, 2.2, 3.5, np.NaN, np.NaN, np.NaN, 6.2, 7.9, 8.3, 9.6, 10.7, np.NaN, np.NaN, 11.1, 13.2, 14.7, np.NaN, 15.6, 18.3],
'Var2':[3.1, 6.2, 2.1, 3.6, np.NaN, 6.2, 13.2, 14.7, 7.9, 8.1, np.NaN, 9.2, 10.3, np.NaN, np.NaN,13.5, 14.2]})

对于NAN,我想插入丢失的数据。这种插值的策略是:

  1. 当只有一个NA时,立即使用前后值,取其平均值并用该平均值替换NA
  2. 当有两个连续的NAs时,用df['varX']替换这两个NAs。插值(方法='polymonal',顺序=2)
  3. 当有两个以上的连续NAs时,用df['varX']替换NAs。插值(方法='polymone',顺序=3/2)

我想:怎么能做到呢

我觉得这篇文章有点帮助,但它不是能帮助我的: Identifying consecutive NaNs with Pandas


Tags: 数据方法importpandasdf顺序asnp
1条回答
网友
1楼 · 发布于 2024-09-21 03:17:41

使用:

df = pd.DataFrame({
'Var1':[1.1, 2.2, 3.5, np.NaN, np.NaN, np.NaN, 6.2, 7.9, 8.3, 9.6, 10.7, np.NaN, np.NaN, 11.1, 13.2, 14.7, np.NaN],
'Var2':[3.1, 6.2, 2.1, 3.6, np.NaN, 6.2, 13.2, 14.7, 7.9, 8.1, np.NaN, 9.2, 10.3, np.NaN, np.NaN,13.5, 14.2]})

#count consecutive NaNs
f = lambda x: x.map(x.value_counts())
df1 = df.notna().cumsum().where(df.isna()).apply(f)
print (df1)
    Var1  Var2
0    NaN   NaN
1    NaN   NaN
2    NaN   NaN
3    3.0   NaN
4    3.0   1.0
5    3.0   NaN
6    NaN   NaN
7    NaN   NaN
8    NaN   NaN
9    NaN   NaN
10   NaN   1.0
11   2.0   NaN
12   2.0   NaN
13   NaN   2.0
14   NaN   2.0
15   NaN   NaN
16   1.0   NaN

#compare
m1 = df1.eq(1)
m2 = df1.eq(2)
m3 = df1.gt(2)

#interplations
df11 = df.ffill().add(df.bfill()).div(2)
#if need last/first NaNs replace by previous/last divide 2
#df11 = df.ffill().add(df.bfill(), fill_value=0).div(2)
df22 = df.interpolate(method='polynomial', order=2)

#order need int
df33 = df.interpolate(method='polynomial', order=3)

#set values by np.select
arr = np.select([m1, m2, m3], [df11, df22, df33], default=df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)

print (df)
         Var1       Var2
0    1.100000   3.100000
1    2.200000   6.200000
2    3.500000   2.100000
3    4.175679   3.600000
4    4.492391   4.900000
5    4.987908   6.200000
6    6.200000  13.200000
7    7.900000  14.700000
8    8.300000   7.900000
9    9.600000   8.100000
10  10.700000   8.650000
11  10.857431   9.200000
12  10.323495  10.300000
13  11.100000  11.467171
14  13.200000  12.581168
15  14.700000  13.500000
16        NaN  14.200000

#if need last/first replace by previus 3 values
s1 = df.tail(3).mean().rename(df.index[-1])
s2 = df.head(3).mean().rename(df.index[0])

df = df.fillna(s1).fillna(s2)
print (df)
         Var1       Var2
0    1.100000   3.100000
1    2.200000   6.200000
2    3.500000   2.100000
3    4.175679   3.600000
4    4.492391   4.900000
5    4.987908   6.200000
6    6.200000  13.200000
7    7.900000  14.700000
8    8.300000   7.900000
9    9.600000   8.100000
10  10.700000   8.650000
11  10.857431   9.200000
12  10.323495  10.300000
13  11.100000  11.467171
14  13.200000  12.581168
15  14.700000  13.500000
16  13.950000  14.200000

相关问题 更多 >

    热门问题