如果任何列值不在列表中,请删除组

2024-06-01 21:16:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,例如:

Groups VAL1 VAL2
G1     DOG  3
G1     DOG  3
G1     DOG  3
G1     CAT  3
G2     CAT  3
G2     CAT  3
G2     CAT  3
G3     DOG  3
G3     BIRD 3
G4     CAT  3
G5     DOG  48
G5     DOG  48
G6     DOG  3
G6     DOG  3
G7     CAT  34
G7     CAT  34
G8     DOG  78
G8     CAT  2
G9     DOG  30
G9     DOG  30
G9     DOG  30
G9     CAT  30

我想删除列表(the_list=['BIRD','CAT'])中没有任何VAL1的所有Groups和其中的VAL2 > 20

因此,我应该得到:

Groups VAL1 VAL2
G1     DOG  3
G1     DOG  3
G1     DOG  3
G1     CAT  3
G2     CAT  3
G2     CAT  3
G2     CAT  3
G3     DOG  3
G3     BIRD 3
G4     CAT  3
G6     DOG  3
G6     DOG  3
G7     CAT  34
G7     CAT  34
G8     DOG  78
G8     CAT  2
G9     DOG  30
G9     DOG  30
G9     DOG  30
G9     CAT  30

请问有人有使用熊猫的想法吗

到目前为止,我试过:

 tab.groupby('Groups').filter(lambda x: x['VAL2']>20 & x['VAL1'].isnotin(['BIRD','CAT']))

Tags: catgroupsdogg4g1g2val1bird
3条回答

如果我理解正确,组筛选器应该检查“列表中是否有任何VAL1或所有VAL2小于20”:

the_list = ['BIRD', 'CAT']

df.groupby('Groups').filter(
    lambda g: any(g.VAL1.isin(the_list)) | all(g.VAL2.lt(20)))
   Groups  VAL1  VAL2
0      G1   DOG     3
1      G1   DOG     3
2      G1   DOG     3
3      G1   CAT     3
4      G2   CAT     3
5      G2   CAT     3
6      G2   CAT     3
7      G3   DOG     3
8      G3  BIRD     3
9      G4   CAT     3
12     G6   DOG     3
13     G6   DOG     3
14     G7   CAT    34
15     G7   CAT    34
16     G8   DOG    78
17     G8   CAT     2
18     G9   DOG    30
19     G9   DOG    30
20     G9   DOG    30
21     G9   CAT    30

这些是220万行当前答案的计时。groupby筛选器可读性最好,但在大型数据帧上稍慢一些:

>>> df = pd.concat([df]*100000) # 2200000 rows

>>> %timeit g_transform(df)
441 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit g_agg(df)
528 ms ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit g_filter(df)
593 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • 建立你的条件:
    1. VAL1includeVAL2小于20
    2. 使用.any().all()向整个小组广播您的病情
    3. 使用|(or)操作符将这两个操作符组合在一起
  • 将groupby与您的条件结合使用,以获得相关的组
  • 使用.isin()从前面步骤中选择的组筛选数据帧
# Group should match if VAL1 contains at least one element of the list:
cond1 = lambda g: g.VAL1.isin(['BIRD','CAT']).any()

# Group should match is all values in VAL2 are smaller than 20
cond2 = lambda g: (g.VAL2 < 20).all()

# Relevant groups:
selection = df.groupby("Groups").agg(lambda g: cond1(g) | cond2(g)).index

# Final dataframe
df = df[df.Groups.isin(selection)]

# One-liner
df = df[df.Groups.isin(df.groupby("Groups").agg(lambda g: (g.VAL1.isin(['BIRD','CAT']).any() |  (g.VAL2 < 20).all())).index)]

如果性能很重要,不要使用filter

更快的解决方案是将^{}anyall一起使用,并在^{}中过滤:

L = ['BIRD', 'CAT']

m1 = df.VAL1.isin(L)
m2 = df.VAL2.lt(20)


df1=df[m1.groupby(df['Groups']).transform('any')|m2.groupby(df['Groups']).transform('all')]
print (df1)
   Groups  VAL1  VAL2
0      G1   DOG     3
1      G1   DOG     3
2      G1   DOG     3
3      G1   CAT     3
4      G2   CAT     3
5      G2   CAT     3
6      G2   CAT     3
7      G3   DOG     3
8      G3  BIRD     3
9      G4   CAT     3
12     G6   DOG     3
13     G6   DOG     3
14     G7   CAT    34
15     G7   CAT    34
16     G8   DOG    78
17     G8   CAT     2
18     G9   DOG    30
19     G9   DOG    30
20     G9   DOG    30
21     G9   CAT    30

相关问题 更多 >