基于组中的一行更新组中的列值

In[1]: df = pd.DataFrame({'test_group': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'test_type': [np.nan,'memory', np.nan, np.nan, 'visual', np.nan, np.nan, 'auditory', np.nan]} Out[1]: test_group test_type 0 1 NaN 1 1 memory 2 1 NaN 3 2 NaN 4 2 visual 5 2 NaN 6 3 NaN 7 3 auditory 8 3 NaN

In [15]: grp = df.groupby('test_group') In [16]: df['test_type'] = grp['test_type'].unique().transform(lambda x: x[1]) In [17]: df Out[17]: test_group test_type 0 1 NaN 1 1 memory 2 1 visual 3 2 auditory 4 2 NaN 5 2 NaN 6 3 NaN 7 3 NaN 8 3 NaN

2条回答

网友

1楼 · 编辑于 2024-10-01 04:59:25

假设每个组都有一个唯一的非nan值，下面的内容应该满足您的要求

>>> df['test_type'] = df.groupby('test_group')['test_type'].ffill().bfill() 
>>> df
   test_group test_type
0           1    memory
1           1    memory
2           1    memory
3           2    visual
4           2    visual
5           2    visual
6           3  auditory
7           3  auditory
8           3  auditory

编辑：

使用的原始答案

df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')

但是根据schwim的计时ffill/bfill看起来要快得多（出于某种原因）

网友

2楼 · 编辑于 2024-10-01 04:59:25

您可以使用^{}获取每个组的大小。然后boolean index使用^{}。现在，使用^{}和^{}

repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)

   test_group test_type
0           1    memory
1           1    memory
2           1    memory
3           2    visual
4           2    visual
5           2    visual
6           3  auditory
7           3  auditory
8           3  auditory

timeit分析：

基准数据帧

df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001, 
                            'test_type' : [np.nan]*10_000 + ['memory'] +
                                          [np.nan]*10_000 + ['visual'] +
                                          [np.nan]*10_000 + ['auditory']})   
df.shape
# (30003, 2)

结果:

# Ch3steR's answer
In [54]: %%timeit 
    ...: repeats = df.groupby('test_group').size() 
    ...: out = df[~df['test_type'].isna()] 
    ...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True) 
    ...:  
    ...:                                                                        
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# timgeb's answer
In [55]: %%timeit 
    ...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill') 
    ...:  
    ...:                                                                                                                 
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

几乎快了4倍。我相信这是因为布尔索引非常快。与双fillna相比，reindex+repeat更轻

timeit分析：

相关问题更多 >

编程相关推荐

热门问题

热门文章