删除出现N次以上的重复值

3条回答

网友

1楼 · 编辑于 2024-06-24 13:23:19

将^{}与^{}一起使用：

df = df[df.groupby('lid')['lid'].transform('size') <= 2]

print (df)
   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

细节：

print (df.groupby('lid')['lid'].transform('size'))
0    3
1    3
2    3
3    2
4    2
5    1
6    1
7    3
8    3
9    3
Name: lid, dtype: int64

print (df.groupby('lid')['lid'].transform('size') <= 2)
0    False
1    False
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: lid, dtype: bool

另一个具有filter的更慢的解决方案：

df = df.groupby('lid').filter(lambda x: len(x) <= 2)
print (df)
   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

时间安排：

#jez1
In [34]: %timeit (df[df.groupby('lid')['lid'].transform('size') <= 2000])
10 loops, best of 3: 57.8 ms per loop

#jez2
In [35]: %timeit df.groupby('lid').filter(lambda x: len(x) <= 2000)
10 loops, best of 3: 124 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ
In [36]: %timeit (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
10 loops, best of 3: 93.6 ms per loop

#pir1
In [37]: %timeit (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])
10 loops, best of 3: 137 ms per loop

#pir2
In [38]: %timeit (pir(df))
10 loops, best of 3: 32.9 ms per loop

设置：

np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'lid': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
                   'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','lid']).reset_index(drop=True)
#print (df)


print (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
print (df[df.groupby('lid')['lid'].transform('size') <= 2000])
print (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])


def pir(df):
    f, u = pd.factorize(df.lid)
    return df[np.bincount(f)[f] <= 2000]

print (pir(df))

警告

结果并不涉及给定组数的性能，这将在很大程度上影响某些解决方案的计时。你知道吗

网友
2楼 · 编辑于 2024-06-24 13:23:19

使用groupby+transform。你知道吗
df[~df.lid.groupby(df.lid).transform('count').gt(2)]
entity pnb head# state lid 3 ABB002 A02 4 DOWN B 4 ABB002 A02 4 DOWN B 5 ABB002 A02 2 DOWN C 6 ABB002 A02 4 DOWN D
transform获取一系列相同大小的计数。你知道吗
v = df.lid.groupby(df.lid).transform('count') v 0 3 1 3 2 3 3 2 4 2 5 1 6 1 7 3 8 3 9 3 Name: lid, dtype: int
用它来找出需要去哪一行。你知道吗
~v.gt(2) 0 False 1 False 2 False 3 True 4 True 5 True 6 True 7 False 8 False 9 False Name: lid, dtype: bool
使用掩码索引df。你知道吗

网友
3楼 · 编辑于 2024-06-24 13:23:19

选项0
使用value_counts和isin

df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2].index)]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

选项1
更好地用np.in1d和pd.factorize实现

lids = df.lid.values
f, u = pd.factorize(df.lid.values)
df[np.in1d(lids, u[np.bincount(f) <= 2])]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

选项2
使用np.bincount和pd.factorize

f, u = pd.factorize(df.lid)
df[np.bincount(f)[f] <= 2]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

为了有趣的演示来强调@cᴏʟᴅsᴘᴇᴇᴅ和我在评论中谈论的内容。你知道吗

Love the bincount one. There should be a np.unique one too, somewhere. – cᴏʟᴅsᴘᴇᴇᴅ
Yes there is. However, I don't use np.unique because @Jeff informed me that np.unique sorts when you grab counts or index or inverse. pd.factorize does not and is O(n). I've since validated that information. – piRSquared

时间测试

def bincount_factorize(df):
    f, u = pd.factorize(df.lid.values)
    return df[np.bincount(f)[f] <= 2]

def bincount_unique(df):
    u, f = np.unique(df.lid.values, return_inverse=True)
    return df[np.bincount(f)[f] <= 2]

def in1d_factorize(df):
    lids = df.lid.values
    f, u = pd.factorize(df.lid.values)
    return df[np.in1d(lids, u[np.bincount(f) <= 2])]

def transform(df):
    return df[df.groupby('lid')['lid'].transform('size') <= 2]

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000,
           30000, 100000, 300000, 1000000],
    columns=['bincount_factorize', 'bincount_unique',
             'in1d_factorize', 'transform'],
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = f'{j}(d)'
        setp = f'from __main__ import d, {j}'
        res.at[i, j] = timeit(stmt, setp, number=100)

res.div(res.min(1), 0)

         bincount_factorize  bincount_unique  in1d_factorize  transform
10                 1.421827         1.000000        1.119577   3.751167
30                 1.008412         1.037297        1.000000   3.072631
100                1.000000         1.531300        1.028267   3.304560
300                1.000000         2.666583        1.182812   3.637235
1000               1.065213         5.563098        1.000000   2.556469
3000               1.024658        10.480027        1.000000   2.238765
10000              1.073403        14.716801        1.000000   1.574780
30000              1.000000        16.387130        1.053180   1.494161
100000             1.000000        18.533078        1.003031   1.369867
300000             1.078129        20.183122        1.000000   1.530698
1000000            1.166800        24.571463        1.000000   1.670423

res.plot(loglog=True)

相关问题更多 >

编程相关推荐

热门问题

热门文章