<p>将<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.transform.html" rel="nofollow noreferrer">^{<cd1>}</a>与<a href="http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing" rel="nofollow noreferrer">^{<cd2>}</a>一起使用:</p>
<pre><code>df = df[df.groupby('lid')['lid'].transform('size') <= 2]
print (df)
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
</code></pre>
<p>细节:</p>
<pre><code>print (df.groupby('lid')['lid'].transform('size'))
0 3
1 3
2 3
3 2
4 2
5 1
6 1
7 3
8 3
9 3
Name: lid, dtype: int64
print (df.groupby('lid')['lid'].transform('size') <= 2)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 False
Name: lid, dtype: bool
</code></pre>
<p>另一个具有<a href="http://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration" rel="nofollow noreferrer">filter</a>的更慢的解决方案:</p>
<pre><code>df = df.groupby('lid').filter(lambda x: len(x) <= 2)
print (df)
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
</code></pre>
<p>时间安排:</p>
<pre><code>#jez1
In [34]: %timeit (df[df.groupby('lid')['lid'].transform('size') <= 2000])
10 loops, best of 3: 57.8 ms per loop
#jez2
In [35]: %timeit df.groupby('lid').filter(lambda x: len(x) <= 2000)
10 loops, best of 3: 124 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ
In [36]: %timeit (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
10 loops, best of 3: 93.6 ms per loop
#pir1
In [37]: %timeit (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])
10 loops, best of 3: 137 ms per loop
#pir2
In [38]: %timeit (pir(df))
10 loops, best of 3: 32.9 ms per loop
</code></pre>
<p><strong>设置</strong>:</p>
<pre><code>np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'lid': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','lid']).reset_index(drop=True)
#print (df)
print (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
print (df[df.groupby('lid')['lid'].transform('size') <= 2000])
print (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])
def pir(df):
f, u = pd.factorize(df.lid)
return df[np.bincount(f)[f] <= 2000]
print (pir(df))
</code></pre>
<p><strong>警告</p>
<p>结果并不涉及给定组数的性能,这将在很大程度上影响某些解决方案的计时。你知道吗</p>