用同一列中的其他值替换NaN值

2024-09-29 01:30:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我的DF看起来像那样

id    zip     location
X2    65123   Houston
T5    65123   Houston
A1    nan     Houston
M8    89517   Berkley
X3    89518   Berkley
N2    nan     Berkley
M9    nan     nan

对于'zip'中的某些值,我没有zipcode,但在'location'中有一个条目。
我想用同一位置的一个ZipCode来填充'zip'中的nan值。有时有不止一个选项,例如N2有两个选项89517和89518,选择哪一个并不重要。但我不想改变nan的邮政编码和位置。我该怎么做


Tags: iddfa1选项locationnanzipx2
2条回答

如果您不关心填写哪个值,一个简单的方法是按位置和zip对表进行排序,然后使用fillna和method='ffill'

 >>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2      NaN  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5      NaN  Berkley

>>> df.sort_values(by=['location','zip']).fillna(method='ffill')
       zip location
3  89517.0  Berkley
4  89518.0  Berkley
5  89518.0  Berkley
0  65123.0  Houston
1  65123.0  Houston
2  65123.0  Houston

更新:下面的解决方案也处理位置中的nan。首先使用groupby函数,然后在组内按max填充na

>>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2      NaN  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5      NaN  Berkley
6      NaN      NaN

>>> df['zip'] = df.groupby('location')['zip'].apply(lambda x:x.fillna(x.max()))
>>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2  65123.0  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5  89518.0  Berkley
6      NaN      NaN

由于您不关心使用哪个值,我们可以使用max值:

>>> df['zip'] = df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max())).astype(int)
>>> df

   id    zip location
0  X2  65123  Houston
1  T5  65123  Houston
2  A1  65123  Houston
3  M8  89517  Berkley
4  X3  89518  Berkley
5  N2  89518  Berkley

如果需要处理ziplocation都是NaN的情况,首先,过滤掉子组:

>>> sub_df = df.loc[df[['zip', 'location']].notna().any(1)]
>>> df
   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1      NaN  Houston
3  M7      NaN      NaN    # <  added a line in between to show index is maintained
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2      NaN  Berkley
7  M9      NaN      NaN

>>> sub_df
   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1      NaN  Houston    # <  No index 3
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2      NaN  Berkley

然后执行相同的操作(只是这次您不需要强制转换为int,因为您的帧中无论如何都会有NaN):

df['zip'] = sub_df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max()))

结果:

   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1  65123.0  Houston
3  M7      NaN      NaN
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2  89518.0  Berkley
7  M9      NaN      NaN

相关问题 更多 >