<p>您可以使用<a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.size.html" rel="nofollow noreferrer"><strong>^{<cd1>}</strong></a>获取每个组的大小。然后<a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing" rel="nofollow noreferrer"><em>boolean index</em></a>使用<a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.isna.html" rel="nofollow noreferrer"><strong>^{<cd2>}</strong></a>。现在,使用<a href="https://pandas.pydata.org/docs/reference/api/pandas.Index.repeat.html" rel="nofollow noreferrer"><strong>^{<cd3>}</strong></a>和<a href="https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.reindex.html" rel="nofollow noreferrer"><strong>^{<cd4>}</strong></a></p>
<pre><code>repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
</code></pre>
<hr/>
<h3>timeit分析:</h3>
<p>基准数据帧</p>
<pre><code>df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001,
'test_type' : [np.nan]*10_000 + ['memory'] +
[np.nan]*10_000 + ['visual'] +
[np.nan]*10_000 + ['auditory']})
df.shape
# (30003, 2)
</code></pre>
<hr/>
<p>结果:</p>
<pre><code># Ch3steR's answer
In [54]: %%timeit
...: repeats = df.groupby('test_group').size()
...: out = df[~df['test_type'].isna()]
...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
...:
...:
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# timgeb's answer
In [55]: %%timeit
...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
...:
...:
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</code></pre>
<p>几乎快了4倍。我相信这是因为布尔索引非常快。与双fillna相比,reindex+repeat更轻</p>