两个文件之间的行匹配

1条回答

网友

1楼 · 发布于 2024-10-05 17:36:54

实现这一点的关键是将mydf中的code列扩展到一个列表中，然后explode将数据帧扩展到基于该列表的更多行中。一旦这样做，就可以合并两个数据帧，然后聚合结果

import pandas as pd

masterdf = pd.DataFrame({'code': ['K', 'L', 'M', 'S', '-'],
                         'name': ['Transcription', 'Replication, recombination and repair',
                                  'Cell wall/membrane/envelope biosynthesis',
                                  'Function unknown', '-']})
print(masterdf)

mydf = pd.DataFrame({'query': [1, 2, 3, 4, 5],
                     'code': ['S', 'K', 'MK', 'LS', '-']})
print(mydf)

注意，我为-添加了一行到masterdf。如果您是从文件加载数据帧，则应该能够在加载数据帧后添加此项

第一步将'MK'分成[M, K]和'LS'分成[L, S]。所有其他代码将是一个元素列表

# Expand the code column into a list of characters
mydf['code'] = mydf.apply(lambda row: list(row['code']), axis=1)
print(mydf)

输出：

   query    code
0      1     [S]
1      2     [K]
2      3  [M, K]
3      4  [L, S]
4      5     [-]

下一步将包含多个代码的行转换为多行，从而允许您在下一步中合并

# Explode the code list into multiple rows
mydf = mydf.explode('code')
print(mydf)

输出：

   query code
0      1    S
1      2    K
2      3    M
2      3    K
3      4    L
3      4    S
4      5    -

合并将从masterdf引入name列

# Merge the two dataframes (how='left' preserves the order in the code column)
merged_df = mydf.merge(masterdf, on='code', how='left').sort_values(['query'])
print(merged_df)

输出：

   query code                                      name
0      1    S                          Function unknown
1      2    K                             Transcription
2      3    M  Cell wall/membrane/envelope biosynthesis
3      3    K                             Transcription
4      4    L     Replication, recombination and repair
5      4    S                          Function unknown
6      5    -                                         -

最后一步将展开的行聚合回所需的结果

# Aggregate the rows back together, grouped by query.
# Join individual codes back to their original values.
# Join corresponding names with &.
df = merged_df.groupby('query').agg({'code': lambda x: ''.join(x.tolist()),
                                     'name': lambda x: ' & '.join(x.tolist())})
print(df)

输出：

      code                                               name
query                                                        
1        S                                   Function unknown
2        K                                      Transcription
3       MK  Cell wall/membrane/envelope biosynthesis & Tra...
4       LS  Replication, recombination and repair & Functi...
5        -                                                  -

相关问题更多 >

编程相关推荐

热门问题

热门文章

两个文件之间的行匹配

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >