您好,我有一个df1文件,例如:
Acc_number
ACC1.1_CP_Sp1_1
ACC2.1_CP_Sp1_1
ACC3.1_CP_Sp1_1
ACC4.1_CP_Sp1_1
和另一个df2,例如:
Cluster_nb SeqName
Cluster1 YP_009216714
Cluster1 YP_002051918
Cluster1 JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1 NW_014464344.1:68901-69716(-):Sp2_3
Cluster1 YP_001956729
Cluster1 ACC1.1_CP_Sp1_1
Cluster1 YP_009213712
Cluster2 ACC2.1_CP_Sp1_1
Cluster2 NR_014464231.1:35866-36717(-):Sp1_1
Cluster2 NR_014464232.1:35889-36788(-):Sp1_1
Cluster2 YP_009213728
Cluster3 ACC3.1_CP_Sp1_1
Cluster3 NK_014464231.1:35772-38898(-):Sp1_2
Cluster3 NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3 YP_009213723
Cluster3 YP_009213739
我想检查df1中的每个Acc_number
,如果包含Acc_number[i]
的groupby
{(+ or -):...
部分中也包含另一个具有相同扩展名的序列(在Acc_number
中_CP_
之后的部分)
比如说
for ACC1.1_CP_Sp1_1 as i
我通过做一个
df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)
Cluster_nb SeqName
Cluster1 YP_009216714
Cluster1 YP_002051918
Cluster1 JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1 NW_014464344.1:68901-69716(-):Sp2_3
Cluster1 YP_001956729
第3行中的序列JZSA01005235.1:37071-37973(-):Sp1_1
在其末端具有相同的Sp1_1
模式
所以这里的答案是肯定的,ACC1.1_CP_Sp1_1与另一个序列在同一个集群中,具有相同的结尾(但名称中有(-or +):
)
for ACC3.1_CP_Sp1_1 as i
我通过做一个
df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)
Cluster3 ACC3.1_CP_Sp1_1
Cluster3 NK_014464231.1:35772-38898(-):Sp1_2
Cluster3 NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3 YP_009213723
Cluster3 YP_009213739
我看到在集群中没有其他序列具有与ACC3.1_CP_Sp1_1
相同的结尾,因此答案是否定的
结果应总结在df3中:
Acc_number present cluster
ACC1.1_CP_Sp1_1 Yes Cluster1
ACC2.1_CP_Sp1_1 Yes Cluster2
ACC3.1_CP_Sp1_1 No NaN
ACC4.1_CP_Sp1_1 No NaN
非常感谢你的帮助
我试过:
for CP in df1['Acc_number']:
df=df2.loc[df2['SeqName']==CP]
try:
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
for a in df3['SeqName']:
if '(+)' in a or '(-)' in a:
if re.sub('.*_CP_','',CP) in a:
new_df=new_df.append({"Cluster":Cluster_number,"Acc_nb":CP,"present":'yes'}, ignore_index=True)
print(CP,'yes')
except:
continue
我在代码本身中做了评论;概述是为每行获取唯一标识符,合并数据帧并仅保留您感兴趣的列:
相关问题 更多 >
编程相关推荐