<p><strong>选项0</strong><br/>
使用<code>value_counts</code>和<code>isin</code></p>
<pre><code>df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2].index)]
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
</code></pre>
<hr/>
<p><strong>选项1</strong><br/>
更好地用<code>np.in1d</code>和<code>pd.factorize</code>实现</p>
<pre><code>lids = df.lid.values
f, u = pd.factorize(df.lid.values)
df[np.in1d(lids, u[np.bincount(f) <= 2])]
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
</code></pre>
<hr/>
<p><strong>选项2</strong><br/>
使用<code>np.bincount</code>和<code>pd.factorize</code></p>
<pre><code>f, u = pd.factorize(df.lid)
df[np.bincount(f)[f] <= 2]
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
</code></pre>
<hr/>
<p>为了有趣的演示来强调@cᴏʟᴅsᴘᴇᴇᴅ和我在评论中谈论的内容。你知道吗</p>
<blockquote>
<p>Love the bincount one. There should be a np.unique one too, somewhere. – cᴏʟᴅsᴘᴇᴇᴅ</p>
<p>Yes there is. However, I don't use np.unique because @Jeff informed me that np.unique sorts when you grab counts or index or inverse. pd.factorize does not and is O(n). I've since validated that information. – piRSquared</p>
</blockquote>
<p><strong>时间测试</p>
<pre><code>def bincount_factorize(df):
f, u = pd.factorize(df.lid.values)
return df[np.bincount(f)[f] <= 2]
def bincount_unique(df):
u, f = np.unique(df.lid.values, return_inverse=True)
return df[np.bincount(f)[f] <= 2]
def in1d_factorize(df):
lids = df.lid.values
f, u = pd.factorize(df.lid.values)
return df[np.in1d(lids, u[np.bincount(f) <= 2])]
def transform(df):
return df[df.groupby('lid')['lid'].transform('size') <= 2]
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000,
30000, 100000, 300000, 1000000],
columns=['bincount_factorize', 'bincount_unique',
'in1d_factorize', 'transform'],
dtype=float
)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)
res.div(res.min(1), 0)
bincount_factorize bincount_unique in1d_factorize transform
10 1.421827 1.000000 1.119577 3.751167
30 1.008412 1.037297 1.000000 3.072631
100 1.000000 1.531300 1.028267 3.304560
300 1.000000 2.666583 1.182812 3.637235
1000 1.065213 5.563098 1.000000 2.556469
3000 1.024658 10.480027 1.000000 2.238765
10000 1.073403 14.716801 1.000000 1.574780
30000 1.000000 16.387130 1.053180 1.494161
100000 1.000000 18.533078 1.003031 1.369867
300000 1.078129 20.183122 1.000000 1.530698
1000000 1.166800 24.571463 1.000000 1.670423
</code></pre>
<hr/>
<pre><code>res.plot(loglog=True)
</code></pre>
<p><a href="https://i.stack.imgur.com/o1Dny.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/o1Dny.png" alt="enter image description here"/></a></p>