聚合数据帧但折叠重复的单元格值

2024-07-03 06:47:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据帧:

df = pd.DataFrame({'query': ['prefix_v1_0001',   'prefix_v1_0002',   'prefix_v1_0003',   'prefix_v1_0004',   'prefix_v1_0004',   'prefix_v1_0004',   'prefix_v1_0004',   'prefix_v1_0004',   'prefix_v1_0004',   'prefix_v1_0005'],
                   'knum': ['-',   '-',   'K03643',   'K02340',   'K02340',   'K02340',   'K02340',   'K02340',   'K03643',   '-'],
                   'definition': ['-',   '-',   'LPS-assembly lipoprotein',   'DNA polymerase III subunit delta [EC:2.7.7.7]',   'DNA polymerase III subunit delta [EC:2.7.7.7]',   'DNA polymerase III subunit delta [EC:2.7.7.7]',   'DNA polymerase III subunit delta [EC:2.7.7.7]',   'DNA polymerase III subunit delta [EC:2.7.7.7]',   'LPS-assembly lipoprotein',   '-'],
                   'A': ['-',   '-',   'Brite Hierarchies (09180)',   'Genetic Information Processing (09120)',   'Genetic Information Processing (09120)',   'Genetic Information Processing (09120)',   'Brite Hierarchies (09180)',   'Brite Hierarchies (09180)',   'Brite Hierarchies (09180)',   '-'],
                   'B': ['-',   '-',   'Protein families: signaling and cellular processes (09183)',   'Replication and repair (09124)',   'Replication and repair (09124)',   'Replication and repair (09124)',   'Protein families: genetic information processing (09182)',   'Protein families: genetic information processing (09182)',   'Protein families: signaling and cellular processes (09183)',   '-'],
                   'C': ['-',   '-',   'Transporters  (02000) [BR:ko0200]',   'DNA replication  (03030) [PATH:ko0303]',   'Mismatch repair  (03430) [PATH:ko0343]',   'Homologous recombination  (03440) [PATH:ko0344]',   'DNA replication proteins  (03032) [BR:ko0303]',   'DNA repair and recombination proteins  (03400) [BR:ko0340]',   'Transporters  (02000) [BR:ko0200]',   '-']})

我想按query分组,并使用“|”字符聚合其他单元格

这是我当前的代码:

df.groupby('query').agg({'knum': lambda x: ' | '.join(x.tolist()),
                         'definition': lambda x: ' | '.join(x.tolist()),
                         'A': lambda x: ' | '.join(x.tolist()),
                         'B': lambda x: ' | '.join(x.tolist()),
                         'C': lambda x: ' | '.join(x.tolist()),
                         })

但是,由于有太多重复的单元格内容,我的表格如下所示: enter image description here 但实际上,对于queryprefix_v1_0004,实际上knum只有两个唯一的值。 我想去掉所有重复的值,或者有没有办法使用aggregate()

这是我想要的输出: enter image description here


Tags: andlambdaprefixquerydnaiiideltav1
2条回答

如果我理解正确:

尝试通过groupby()+agg()并使用set作为唯一值,而不是list

df=df.groupby('query').agg(lambda x:' | '.join(set(x)))

如果顺序很重要,则使用pd.unique()表示唯一值:

df=df.groupby('query').agg(lambda x:' | '.join(pd.unique(x)))

如果要对选定列执行,请创建这些列的列表,并仅对这些列执行聚合:

cols=['knum','definition','A','B','C']
df=df.groupby('query')[cols].agg(lambda x:' | '.join(set(x)))

使用具有唯一值的set代替列表:

df.groupby('query').agg({'knum': lambda x: ' | '.join(set(x)),
                         'definition': lambda x: ' | '.join(set(x)),
                         'A': lambda x: ' | '.join(set(x)),
                         'B': lambda x: ' | '.join(set(x)),
                         'C': lambda x: ' | '.join(set(x)),
                         })

相关问题 更多 >