回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<div>
</div>
<p>我有以下数据帧:</p>
<pre><code>df = pd.DataFrame({'query': ['prefix_v1_0001', 'prefix_v1_0002', 'prefix_v1_0003', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0004', 'prefix_v1_0005'],
'knum': ['-', '-', 'K03643', 'K02340', 'K02340', 'K02340', 'K02340', 'K02340', 'K03643', '-'],
'definition': ['-', '-', 'LPS-assembly lipoprotein', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'DNA polymerase III subunit delta [EC:2.7.7.7]', 'LPS-assembly lipoprotein', '-'],
'A': ['-', '-', 'Brite Hierarchies (09180)', 'Genetic Information Processing (09120)', 'Genetic Information Processing (09120)', 'Genetic Information Processing (09120)', 'Brite Hierarchies (09180)', 'Brite Hierarchies (09180)', 'Brite Hierarchies (09180)', '-'],
'B': ['-', '-', 'Protein families: signaling and cellular processes (09183)', 'Replication and repair (09124)', 'Replication and repair (09124)', 'Replication and repair (09124)', 'Protein families: genetic information processing (09182)', 'Protein families: genetic information processing (09182)', 'Protein families: signaling and cellular processes (09183)', '-'],
'C': ['-', '-', 'Transporters (02000) [BR:ko0200]', 'DNA replication (03030) [PATH:ko0303]', 'Mismatch repair (03430) [PATH:ko0343]', 'Homologous recombination (03440) [PATH:ko0344]', 'DNA replication proteins (03032) [BR:ko0303]', 'DNA repair and recombination proteins (03400) [BR:ko0340]', 'Transporters (02000) [BR:ko0200]', '-']})
</code></pre>
<p>我想按<code>query</code>分组,并使用“|”字符聚合其他单元格</p>
<p>这是我当前的代码:</p>
<pre><code>df.groupby('query').agg({'knum': lambda x: ' | '.join(x.tolist()),
'definition': lambda x: ' | '.join(x.tolist()),
'A': lambda x: ' | '.join(x.tolist()),
'B': lambda x: ' | '.join(x.tolist()),
'C': lambda x: ' | '.join(x.tolist()),
})
</code></pre>
<p>但是,由于有太多重复的单元格内容,我的表格如下所示:
<a href="https://i.stack.imgur.com/Ou5IP.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/Ou5IP.png" alt="enter image description here"/></a>
但实际上,对于<code>query</code>:<code>prefix_v1_0004</code>,实际上<code>knum</code>只有两个唯一的值。
我想去掉所有重复的值,或者有没有办法使用<code>aggregate()</code></p>
<p>这是我想要的输出:
<a href="https://i.stack.imgur.com/nJaRY.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/nJaRY.png" alt="enter image description here"/></a></p>