如何根据优先级顺序替换dataframe的列?

2024-06-01 08:27:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下的数据帧df["Annotations"]

missense_variant&splice_region_variant
stop_gained&splice_region_variant
splice_acceptor_variant&coding_sequence_variant&intron_variant
splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&5_prime_UTR_variant&intron_variant
missense_variant&NMD_transcript_variant
frameshift_variant&splice_region_variant
splice_acceptor_variant&intron_variant
splice_acceptor_variant&coding_sequence_variant
stop_lost&3_prime_UTR_variant
missense_variant
splice_region_variant

我想替换或添加具有订单优先级的新列。优先权如下所示:

Type                 Rank
frameshift_variant      1
stop_gained             2
splice_region_variant   3
splice_acceptor_variant 4
splice_donor_variant    5
missense_variant        6
coding_sequence_variant 7

我想获得replace df['Annotations']或添加新列df['Anno_prio']作为:

splice_region_variant
stop_gained
splice_acceptor_variant
splice_acceptor_variant
missense_variant
frameshift_variant
splice_acceptor_variant
splice_acceptor_variant
stop_lost
missense_variant
splice_region_variant

我尝试的方式是每学期:

df['Annotation']=df['Annotation'].str.replace('missense_variant&splice_region_variant','splice_region_variant')

使用熊猫还有其他方法吗


Tags: dfprimeregionannotationsstopsequencevariantcoding
2条回答

想法是为字典理解中被拆分列表的每个值创建一个get字典,默认值为最大Rank后的下一个值,然后获取dict最小值的键:

d = df1.set_index('Type')['Rank'].to_dict()
max1 = df1['Rank'].max()+1    

def f(x):
    d1 = {y: d.get(y, max1) for y in x for y in x.split('&')}
    #https://stackoverflow.com/a/280156/2901002
    return min(d1, key=d1.get)

df['Anno_prio'] = df['Annotations'].apply(f)
print (df)
                                          Annotations                Anno_prio
0              missense_variant&splice_region_variant    splice_region_variant
1                   stop_gained&splice_region_variant              stop_gained
2   splice_acceptor_variant&coding_sequence_varian...  splice_acceptor_variant
3   splice_donor_variant&splice_acceptor_variant&c...  splice_acceptor_variant
4             missense_variant&NMD_transcript_variant         missense_variant
5            frameshift_variant&splice_region_variant       frameshift_variant
6              splice_acceptor_variant&intron_variant  splice_acceptor_variant
7     splice_acceptor_variant&coding_sequence_variant  splice_acceptor_variant
8                       stop_lost&3_prime_UTR_variant                stop_lost
9                                    missense_variant         missense_variant
10                              splice_region_variant    splice_region_variant

Pandas唯一的解决方案是将^{}^{}一起使用,最后一个是使用排序索引删除重复的索引值:

d = df1.set_index('Type')['Rank'].to_dict()

df = (df.assign(Anno_prio = df['Annotations'].str.split('&'))
        .explode('Anno_prio')
        .assign(new = lambda x: x['Anno_prio'].map(d))
        .sort_values('new')
        )
df = df[~df.index.duplicated()].sort_index()

print (df)
                                          Annotations  \
0              missense_variant&splice_region_variant   
1                   stop_gained&splice_region_variant   
2   splice_acceptor_variant&coding_sequence_varian...   
3   splice_donor_variant&splice_acceptor_variant&c...   
4             missense_variant&NMD_transcript_variant   
5            frameshift_variant&splice_region_variant   
6              splice_acceptor_variant&intron_variant   
7     splice_acceptor_variant&coding_sequence_variant   
8                       stop_lost&3_prime_UTR_variant   
9                                    missense_variant   
10                              splice_region_variant   

                  Anno_prio  new  
0     splice_region_variant  3.0  
1               stop_gained  2.0  
2   splice_acceptor_variant  4.0  
3   splice_acceptor_variant  4.0  
4          missense_variant  6.0  
5        frameshift_variant  1.0  
6   splice_acceptor_variant  4.0  
7   splice_acceptor_variant  4.0  
8                 stop_lost  NaN  
9          missense_variant  6.0  
10    splice_region_variant  3.0  

过程:

  1. 按“&;”拆分并使用^{}将列表中的每个元素转换为一行
  2. 使用map Series将Type转换为Rank
  3. 然后对秩进行排序,并使用原始索引删除重复项
  4. Annotations中的第一个类型填充NA
anno_map = df_rank.set_index('Type')['Rank']
obj_anno_split = df['Annotations'].str.split('&')
df_anno_map = obj_anno_split.explode().reset_index()
# create a new column rank use map
df_anno_map['rank'] = df_anno_map['Annotations'].map(anno_map)

# keep the first rank for every index, by sort and drop_duplicates
df_anno_map = (df_anno_map.dropna()
                  .sort_values('rank')
                  .drop_duplicates('index', keep='first')
                  .set_index('index')
                  .sort_index())

# assing Anno_prio with index broadcast
df['Anno_prio'] = df_anno_map['Annotations']

# fillna with the the split's first item
df['Anno_prio'] = df['Anno_prio'].combine_first(obj_anno_split.str[0])

# print(df_anno_map)
# print(df)

结果:

print(df_anno_map)

                  Annotations  rank
index                               
0        splice_region_variant   3.0
1                  stop_gained   2.0
2      splice_acceptor_variant   4.0
3      splice_acceptor_variant   4.0
4             missense_variant   6.0
5           frameshift_variant   1.0
6      splice_acceptor_variant   4.0
7      splice_acceptor_variant   4.0
9             missense_variant   6.0
10       splice_region_variant   3.0

print(df)
                                         Annotations                Anno_prio
0              missense_variant&splice_region_variant    splice_region_variant
1                   stop_gained&splice_region_variant              stop_gained
2   splice_acceptor_variant&coding_sequence_varian...  splice_acceptor_variant
3   splice_donor_variant&splice_acceptor_variant&c...  splice_acceptor_variant
4             missense_variant&NMD_transcript_variant         missense_variant
5            frameshift_variant&splice_region_variant       frameshift_variant
6              splice_acceptor_variant&intron_variant  splice_acceptor_variant
7     splice_acceptor_variant&coding_sequence_variant  splice_acceptor_variant
8                       stop_lost&3_prime_UTR_variant                stop_lost
9                                    missense_variant         missense_variant
10                              splice_region_variant    splice_region_variant

相关问题 更多 >