我基于tblastn点击提取核苷酸序列的python代码也是s

> from Bio import SeqIO > from Bio.Blast import NCBIXML > > infile_path = '/home/edson/ungulate/ungulate.fa' # this is a file > which contain unaligned nucleotide sequences outfile_path = > '/home/edson/ungulate/tblastn_result.fa' > > for seq_record in SeqIO.parse(infile_path, 'fasta'): > flag = seq_record.description # a flag is sequence identifier in a fasta file format with open(outfile_path, 'a') as outfile: with open('/home/edson/ungulate/tblastn_result.xml') as tblastn_file: tblastn_records = NCBIXML.parse(tblastn_file) for tblastn_record in tblastn_records: for alignment in tblastn_record.alignments[:4]: for hsp in alignment.hsps: if flag in alignment.title: # this cross check if sequence identifier is present in an XML file > sub_record = seq_record.seq[hsp.sbjct_start:hsp.sbjct_end] # this takes sequences in an infile path and slice them based on tblastn output > outfile.write('>' + seq_record.description + '\n') > outfile.write(str(sub_record + '\n'))

1条回答

网友

1楼 · 发布于 2024-09-29 22:04:11

至少有两个明显的瓶颈-对于外部循环的每个迭代，您

重新打开outfile
重新打开并重新解析tblastn_file

只需将这些操作移到外循环之外，就可以显著提高性能（当然，如果有多个外循环迭代）。在

另一个可能的改进：在每次迭代̀alignment.hsps上测试{}。对于相同的hsps，这个测试对于所有hsps都是常量，所以最好放在前面，即：

for alignment in tblastn_record.alignments[:4]:
    if flag in alignment.title:  
        for hsp in alignment.hsps:
           # etc...

相关问题更多 >

编程相关推荐

热门问题

热门文章