根据fasta SeqId ord合并我的数据帧

2024-10-02 12:33:20 发布

您现在位置:Python中文网/ 问答频道 /正文

实际上,我有两个fasta文件候选者和候选者

和两个数据帧Best_blast_候选者_hit_0042.csv和Best_blast_候选者_hit_0035.csv

以下是其中的一个示例:

qseqid  sseqid  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    salltitles  staxids scientific_name scomnames   sskingdoms  Order
g44459.t1_0035_0035 XP_011687429.1  39.5    157 95  0   7   163 2   158 8.1e-27 129.8   uncharacterized protein LOC105449744 [Wasmannia auropunctata]   64793   Wasmannia auropunctata      Eukaryota   Hymenoptera
g17612.t1_0035_0042 XP_011699787.1  59.3    349 142 0   99  447 336 684 1.5e-120    442.6   uncharacterized protein LOC105457055 [Wasmannia auropunctata]   64793   Wasmannia auropunctata      Eukaryota   Hymenoptera
g29924.t1_0035_0042 XP_011871948.1  67.0    261 85  1   1   260 18  278 1.3e-100    375.6   uncharacterized protein LOC105564266, partial [Vollenhovia emeryi]  411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g47960.t1_0035_0035 XP_011860868.1  68.8    298 93  0   1   298 142 439 3.3e-116    427.6   uncharacterized protein LOC105558006 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g28580.t1_0035_0042 XP_011883624.1  70.0    240 69  3   1   239 41  278 1.3e-86 328.9   uncharacterized protein LOC105570787 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera

qseqid  sseqid  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    salltitles  staxids scientific_name scomnames   sskingdoms  Order
g34354.t1_0042_0035 XP_011699801.1  43.7    135 63  4   7   128 625 759 9.3e-17 96.3    LOW QUALITY PROTEIN 64793   Wasmannia auropunctata      Eukaryota   Hymenoptera
g34606.t1_0042_0035 XP_011871948.1  59.8    249 79  2   1   228 51  299 3.4e-81 310.8   uncharacterized protein LOC105564266, partial [Vollenhovia emeryi]  411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g13215.t1_0042_0042 XP_011883625.1  62.0    242 92  0   46  287 160 401 5.4e-82 313.9   uncharacterized protein LOC105570788, partial [Vollenhovia emeryi]  411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g35379.t1_0042_0035 XP_011858260.1  73.3    191 51  0   4   194 690 880 6.3e-76 293.1   uncharacterized protein LOC105555830 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g13770.t1_0042_0042 XP_011883624.1  66.5    203 65  3   10  211 33  233 1.9e-65 258.5   uncharacterized protein LOC105570787 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera

实际上,我必须合并它们,但顺序与fasta文件中的seqID相同

例如,如果fasta文件1包含:

>seq1_0035_0042
ATGGAGAGATAG
>seq6_0035_0035
ATGGATAGAGA

fasta文件2包含:

>seq8_0042_0042
ATGGAGAGATAG
>seq3_0042_0035
ATGGATAGAGA

然后,我希望按照该顺序合并我的数据帧:

例:

qseqid_1       qseqid_2       sseqid_1       sseqid_2       pident_1 pident_2 etc...
seq1_0035_0042 XP_011883678.1 seq8_0042_0042 XP_011883789.1   78.9   45.9 etc
seq6_0035_0035 XP_011566754.1 seq3_0042_0035 XP_011566754.1   67.9   78.0. etc

Ps:fasta文件中的所有SeqId都不在数据帧中,因此如果没有一对,我们可以在数据帧处添加它,并在列_2部分处添加Nan吗? 谢谢你的帮助:)


Tags: 文件数据xpfastat1protein候选者eukaryota

热门问题