我有以下数据帧:
df = pd.DataFrame({'query': ['prefix_v1_0001', 'prefix_v1_0002', 'prefix_v1_0003', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0005'],
'knum': ['-', '-', 'K03643', 'K02340', 'K02340', 'K02340', 'K02340', 'K02340', 'K03643', '-'],
'definition': ['-', '-', 'LPS-assembly lipoprotein', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'LPS-assembly lipoprotein', '-'],
'A': ['-', '-', 'Brite Hierarchies (09180)', 'Genetic Information Processing (09120)', 'Genetic Information Processing (09120)', 'Genetic Information Processing (09120)', 'Brite Hierarchies (09180)', 'Brite Hierarchies (09180)', 'Brite Hierarchies (09180)', '-'],
'B': ['-', '-', 'Protein families: signaling and cellular processes (09183)', 'Replication and repair (09124)', 'Replication and repair (09124)', 'Replication and repair (09124)', 'Protein families: genetic information processing (09182)', 'Protein families: genetic information processing (09182)', 'Protein families: signaling and cellular processes (09183)', '-'],
'C': ['-', '-', 'Transporters (02000) [BR:ko0200]', 'DNA replication (03030) [PATH:ko0303]', 'Mismatch repair (03430) [PATH:ko0343]', 'Homologous recombination (03440) [PATH:ko0344]', 'DNA replication proteins (03032) [BR:ko0303]', 'DNA repair and recombination proteins (03400) [BR:ko0340]', 'Transporters (02000) [BR:ko0200]', '-']})
我想按query
分组,并使用“|”字符聚合其他单元格
这是我当前的代码:
df.groupby('query').agg({'knum': lambda x: ' | '.join(x.tolist()),
'definition': lambda x: ' | '.join(x.tolist()),
'A': lambda x: ' | '.join(x.tolist()),
'B': lambda x: ' | '.join(x.tolist()),
'C': lambda x: ' | '.join(x.tolist()),
})
但是,由于有太多重复的单元格内容,我的表格如下所示:
但实际上,对于query
:prefix_v1_0004
,实际上knum
只有两个唯一的值。
我想去掉所有重复的值,或者有没有办法使用aggregate()
如果我理解正确:
尝试通过
groupby()
+agg()
并使用set
作为唯一值,而不是list
:或
如果顺序很重要,则使用
pd.unique()
表示唯一值:或
如果要对选定列执行,请创建这些列的列表,并仅对这些列执行聚合:
使用具有唯一值的
set
代替列表:相关问题 更多 >
编程相关推荐