删除出现N次以上的重复值

2024-06-24 13:23:19 发布

您现在位置:Python中文网/ 问答频道 /正文

在“lid”列中有一个重复值的数据框。我想使用Pandas删除列“lid”中其值被计数超过2次的行。 这是原始表格:

entity  pnb head#   state   lid
ABB001  A03 3   DOWN    A
ABB001  A03 3   DOWN    A
ABB001  A03 3   DOWN    A
ABB002  A02 4   DOWN    B
ABB002  A02 4   DOWN    B
ABB002  A02 2   DOWN    C
ABB002  A02 4   DOWN    D
ABB002  A02 4   DOWN    E
ABB002  A02 4   DOWN    E
ABB002  A02 4   DOWN    E

结果如下:

entity  pnb head#   state   lid
ABB002  A02 4   DOWN    B
ABB002  A02 4   DOWN    B
ABB002  A02 2   DOWN    C
ABB002  A02 4   DOWN    D

Tags: 数据pandashead表格down计数entitystate
3条回答

^{}^{}一起使用:

df = df[df.groupby('lid')['lid'].transform('size') <= 2]

print (df)
   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

细节:

print (df.groupby('lid')['lid'].transform('size'))
0    3
1    3
2    3
3    2
4    2
5    1
6    1
7    3
8    3
9    3
Name: lid, dtype: int64

print (df.groupby('lid')['lid'].transform('size') <= 2)
0    False
1    False
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: lid, dtype: bool

另一个具有filter的更慢的解决方案:

df = df.groupby('lid').filter(lambda x: len(x) <= 2)
print (df)
   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

时间安排:

#jez1
In [34]: %timeit (df[df.groupby('lid')['lid'].transform('size') <= 2000])
10 loops, best of 3: 57.8 ms per loop

#jez2
In [35]: %timeit df.groupby('lid').filter(lambda x: len(x) <= 2000)
10 loops, best of 3: 124 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ
In [36]: %timeit (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
10 loops, best of 3: 93.6 ms per loop

#pir1
In [37]: %timeit (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])
10 loops, best of 3: 137 ms per loop

#pir2
In [38]: %timeit (pir(df))
10 loops, best of 3: 32.9 ms per loop

设置

np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'lid': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
                   'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','lid']).reset_index(drop=True)
#print (df)


print (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
print (df[df.groupby('lid')['lid'].transform('size') <= 2000])
print (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])


def pir(df):
    f, u = pd.factorize(df.lid)
    return df[np.bincount(f)[f] <= 2000]

print (pir(df))

警告

结果并不涉及给定组数的性能,这将在很大程度上影响某些解决方案的计时。你知道吗

使用groupby+transform。你知道吗

df[~df.lid.groupby(df.lid).transform('count').gt(2)]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

transform获取一系列相同大小的计数。你知道吗

v = df.lid.groupby(df.lid).transform('count')
v

0    3
1    3
2    3
3    2
4    2
5    1
6    1
7    3
8    3
9    3
Name: lid, dtype: int

用它来找出需要去哪一行。你知道吗

~v.gt(2)

0    False
1    False
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: lid, dtype: bool

使用掩码索引df。你知道吗

选项0
使用value_countsisin

df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2].index)]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

选项1
更好地用np.in1dpd.factorize实现

lids = df.lid.values
f, u = pd.factorize(df.lid.values)
df[np.in1d(lids, u[np.bincount(f) <= 2])]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

选项2
使用np.bincountpd.factorize

f, u = pd.factorize(df.lid)
df[np.bincount(f)[f] <= 2]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

为了有趣的演示来强调@cᴏʟᴅsᴘᴇᴇᴅ和我在评论中谈论的内容。你知道吗

Love the bincount one. There should be a np.unique one too, somewhere. – cᴏʟᴅsᴘᴇᴇᴅ

Yes there is. However, I don't use np.unique because @Jeff informed me that np.unique sorts when you grab counts or index or inverse. pd.factorize does not and is O(n). I've since validated that information. – piRSquared

时间测试

def bincount_factorize(df):
    f, u = pd.factorize(df.lid.values)
    return df[np.bincount(f)[f] <= 2]

def bincount_unique(df):
    u, f = np.unique(df.lid.values, return_inverse=True)
    return df[np.bincount(f)[f] <= 2]

def in1d_factorize(df):
    lids = df.lid.values
    f, u = pd.factorize(df.lid.values)
    return df[np.in1d(lids, u[np.bincount(f) <= 2])]

def transform(df):
    return df[df.groupby('lid')['lid'].transform('size') <= 2]

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000,
           30000, 100000, 300000, 1000000],
    columns=['bincount_factorize', 'bincount_unique',
             'in1d_factorize', 'transform'],
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = f'{j}(d)'
        setp = f'from __main__ import d, {j}'
        res.at[i, j] = timeit(stmt, setp, number=100)

res.div(res.min(1), 0)

         bincount_factorize  bincount_unique  in1d_factorize  transform
10                 1.421827         1.000000        1.119577   3.751167
30                 1.008412         1.037297        1.000000   3.072631
100                1.000000         1.531300        1.028267   3.304560
300                1.000000         2.666583        1.182812   3.637235
1000               1.065213         5.563098        1.000000   2.556469
3000               1.024658        10.480027        1.000000   2.238765
10000              1.073403        14.716801        1.000000   1.574780
30000              1.000000        16.387130        1.053180   1.494161
100000             1.000000        18.533078        1.003031   1.369867
300000             1.078129        20.183122        1.000000   1.530698
1000000            1.166800        24.571463        1.000000   1.670423

res.plot(loglog=True)

enter image description here

相关问题 更多 >