使用groupby将重复值替换为NaN

2024-06-25 05:48:59 发布

您现在位置:Python中文网/ 问答频道 /正文

数据集(MWE)

location    date        people_vaccinated   people_fully_vaccinated people_vaccinated_per_hundred
AL          12-01-2021  70861               7270                    1.45
AL          13-01-2021  74792               9245                    1.53
AL          14-01-2021  80480               11366                   1.64
AL          15-01-2021  86956               13488                   1.77
AL          16-01-2021  93797               14202                   1.91
AL          17-01-2021  100638              14917                   2.05
AS          22-01-2021  5627                940                     10.1
AS          23-01-2021  5881                948                     10.56
AS          24-01-2021  7096                948                     12.74
AS          25-01-2021  7096                949                     12.98
AS          26-01-2021  7230                950                     13.23
AS          27-01-2021  8133                950                     14.6

我试图在location上使用groupby()时用NaN替换{{{}列中的重复项。我在网上尝试了一些解决方案,但无法让它们为我工作,所以使用了下面的逻辑

def remove(df , a):
    df['duplicate'] = df[a].shift(1)
    df[a] = df.apply(lambda x: np.nan if x[a] == x['duplicate'] \
                            else x[a], axis=1)
    df = df.drop('duplicate', axis=1)
    return df

dfn = remove(dfn,'people_vaccinated')
dfn = remove(dfn,'people_fully_vaccinated')
dfn = remove(dfn,'people_vaccinated_per_hundred')

当您有连续的空值(超过2)时,上述逻辑将失败。我需要用NAN替换重复项(同时保留第一个实例)。最好的方法是什么?您可以从上面的代码片段中观察到people_fully_vaccinated列具有重复的值

样本输出

location    date        people_vaccinated   people_fully_vaccinated people_vaccinated_per_hundred
AL          12-01-2021  70861               7270                    1.45
AL          13-01-2021  74792               9245                    1.53
AL          14-01-2021  80480               11366                   1.64
AL          15-01-2021  86956               13488                   1.77
AL          16-01-2021  93797               14202                   1.91
AL          17-01-2021  100638              14917                   2.05
AS          22-01-2021  5627                940                     10.1
AS          23-01-2021  5881                948                     10.56
AS          24-01-2021  7096                NaN                     12.74
AS          25-01-2021  NaN                 949                     12.98
AS          26-01-2021  7230                950                     13.23
AS          27-01-2021  8133                NaN                     14.6

Tags: dfdateaslocation逻辑nanpeopleremove
2条回答

这里尝试使用^{}创建布尔掩码

import numpy as np

remove = lambda x: df[x].duplicated(keep='first')  # return boolean mask

df.loc[remove('people_vaccinated'), 'people_vaccinated'] = np.NaN

^{}+^{}

我们可以定义一个列名列表,然后为for循环中的每一列mask定义每个唯一的重复值location

cols = ['people_vaccinated', 'people_fully_vaccinated', 'people_vaccinated_per_hundred']
for c in cols:
    df[c] = df[c].mask(df.duplicated(['location', c]))

   location        date  people_vaccinated  people_fully_vaccinated  people_vaccinated_per_hundred
0        AL  12-01-2021            70861.0                   7270.0                           1.45
1        AL  13-01-2021            74792.0                   9245.0                           1.53
2        AL  14-01-2021            80480.0                  11366.0                           1.64
3        AL  15-01-2021            86956.0                  13488.0                           1.77
4        AL  16-01-2021            93797.0                  14202.0                           1.91
5        AL  17-01-2021           100638.0                  14917.0                           2.05
6        AS  22-01-2021             5627.0                    940.0                          10.10
7        AS  23-01-2021             5881.0                    948.0                          10.56
8        AS  24-01-2021             7096.0                      NaN                          12.74
9        AS  25-01-2021                NaN                    949.0                          12.98
10       AS  26-01-2021             7230.0                    950.0                          13.23
11       AS  27-01-2021             8133.0                      NaN                          14.60

相关问题 更多 >