基于组中的一行更新组中的列值问题的回答

基于组中的一行更新组中的列值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

您可以使用<a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.size.html" rel="nofollow noreferrer">^{<cd1>}</a>获取每个组的大小。然后<a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing" rel="nofollow noreferrer">boolean index</a>使用<a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.isna.html" rel="nofollow noreferrer">^{<cd2>}</a>。现在，使用<a href="https://pandas.pydata.org/docs/reference/api/pandas.Index.repeat.html" rel="nofollow noreferrer">^{<cd3>}</a>和<a href="https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.reindex.html" rel="nofollow noreferrer">^{<cd4>}</a> <pre><code>repeats = df.groupby('test_group').size() out = df[~df['test_type'].isna()] out.reindex(out.index.repeat(repeats)).reset_index(drop=True) test_group test_type 0 1 memory 1 1 memory 2 1 memory 3 2 visual 4 2 visual 5 2 visual 6 3 auditory 7 3 auditory 8 3 auditory </code></pre> <hr/> <h3>timeit分析：</h3> 基准数据帧 <pre><code>df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001, 'test_type' : [np.nan]*10_000 + ['memory'] + [np.nan]*10_000 + ['visual'] + [np.nan]*10_000 + ['auditory']}) df.shape # (30003, 2) </code></pre> <hr/> 结果: <pre><code># Ch3steR's answer In [54]: %%timeit ...: repeats = df.groupby('test_group').size() ...: out = df[~df['test_type'].isna()] ...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True) ...: ...: 2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # timgeb's answer In [55]: %%timeit ...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill') ...: ...: 10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> 几乎快了4倍。我相信这是因为布尔索引非常快。与双fillna相比，reindex+repeat更轻

基于组中的一行更新组中的列值

1 个回答

相关Python问题