<p>对计数更像<code>100</code>次的仅筛选行使用<a href="http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing" rel="nofollow noreferrer">^{<cd1>}</a>,<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.transform.html" rel="nofollow noreferrer">^{<cd3>}</a>和<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.size.html" rel="nofollow noreferrer">^{<cd4>}</a>用于返回与原始<code>Series</code>大小相同的<code>DataFrame</code>:</p>
<pre><code>df1 = df[df.groupby('user_id')['question_id'].transform('size') > 100]
</code></pre>
<p><strong>性能</strong>:取决于行数和组长度,因此在实际数据中进行最佳测试:</p>
<pre><code>np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'question_id': np.random.choice(L, N, p=(.75,.0001,.0005,.0005,.2489)),
'user_id':np.random.randint(10000,size=N)})
df = df.sort_values(['user_id','question_id']).reset_index(drop=True)
In [176]: %timeit df[df.groupby('user_id')['question_id'].transform('size') > 100]
74.8 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#coldspeed solutions
In [177]: %timeit df.groupby('user_id').filter(lambda x: len(x) > 100)
1.4 s ± 44.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [178]: %%timeit
...: m = dict(zip(*np.unique(df.user_id, return_counts=True)))
...: df[df['user_id'].map(m) > 100]
...:
89.2 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>