如果行中不包含模式,请使用groupby删除GroupB

2024-10-05 13:46:43 发布

您现在位置:Python中文网/ 问答频道 /正文

你好,我有一个数据帧,比如

col1 col2
G1 OP2
G1 OP0
G1 OPP
G1 OPL_Lh
G2 OII
G2 OIP
G2 IOP
G3 TYU
G4 TUI
G4 TYUI
G4 TR_Lh

我想通过groupby并从df-tha组中删除,这些组在col2中不包含至少一行包含

'_Lh' 

在这里,我应该只保留G1G4并获得:

col1 col2
G1 OP2
G1 OP0
G1 OPP
G1 OPL_Lh
G4 TUI
G4 TYUI
G4 TR_Lh

有人有主意吗?多谢各位


Tags: 数据trcol2col1tuig4g1lh
3条回答

解决这个问题还有很长的路要走,来说明groupby是如何工作的

首先创建一个函数,用于测试所需字符串:

def contains_str(x, string = '_Lh'):
    if string in x:
        return True
    else:
        return False

接下来,迭代您的组并应用此函数:

keep_dict = {}

for label, group_df in df.groupby('col1'):
    keep = group_df['col2'].apply(contains_str).any()
    keep_dict[label] = keep

print(keep_dict)
# {'G1': True, 'G2': False, 'G3': False, 'G4': True}

Feel free to print individual items in the operation to understand their role.

最后,将该词典映射到您当前的df:

df_final = df[df['col1'].map(keep_dict)].reset_index(drop=True)

    col1    col2
0   G1      OP2
1   G1      OP0
2   G1      OPP
3   G1      OPL_Lh
4   G4      TUI
5   G4      TYUI
6   G4      TR_Lh

您可以使用以下代码压缩这些步骤:

keep_dict = df.groupby('col1', as_index=True)['col2'].apply(lambda arr: any([contains_str(x) for x in arr])).to_dict()

print(keep_dict)
# {'G1': True, 'G2': False, 'G3': False, 'G4': True}

I hope this both answers your Q and explains what's taking place "behind the scenes" in groupby operations.

你可以做:

filter_=df.loc[df["col2"].str.contains("_Lh"), "col1"].drop_duplicates()

df=df.merge(filter_, on="col1")

产出:

  col1    col2
0   G1     OP2
1   G1     OP0
2   G1     OPP
3   G1  OPL_Lh
4   G4     TUI
5   G4    TYUI
6   G4   TR_Lh

IIUC

您可以使用布尔测试和isin在包含_Lh的组中进行筛选

m = df[df['col2'].str.contains('_Lh')]['col1']

df[df['col1'].isin(m)].groupby('col1')...

print(df[df['col1'].isin(m)])

   col1    col2
0    G1     OP2
1    G1     OP0
2    G1     OPP
3    G1  OPL_Lh
8    G4     TUI
9    G4    TYUI
10   G4   TR_Lh

相关问题 更多 >

    热门问题