从utra序列中提取RNA

2024-06-25 23:37:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从数据库中检索特定的RNA UTR序列。 从数据库中,我得到了一个.dat文件,其中每个RNA UTR都由如下条目表示:

ID   5MMUR018955; SV 1; linear; mRNA; STD; MUS; 54 BP.
XX
AC   BR058092;
XX
DT   01-JUL-2009 (Rel. 9, Created)
DT   01-JUL-2009 (Rel. 9, Last updated, Version 1)
XX
DE   5'UTR in Mus musculus neutrophilic granule protein (Ngp), mRNA.
XX
DR   REFSEQ; NM_008694;
DR   UTRef; CR062409;
DR   GeneID; 18054;
XX
OS   Mus musculus (house mouse)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea;
OC   Muridae; Murinae; Mus; Mus.
XX
UT   5'UTR; 1 exon(s)
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..54
FT                   /organism="Mus musculus"
FT                   /mol_type="mRNA"
FT                   /strain="C57BL/6"
FT                   /db_xref="taxon:10090"
FT   5'UTR           1..54
FT                   /source="REFSEQ::NM_008694:1..54"
FT                   /gene="Ngp"
FT                   /product="neutrophilic granule protein"
FT                   /gene_synonym="bectenecin"
FT                   /genome="chr9:110322312-110322365:+"
XX
SQ   Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other;
     agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca              54
//

我有一个基因名列表(存储在FT /gene="Ngp"行中),我想用它来访问存储在SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other; agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54行中的序列

在检索之后,我想把这两个文件都写进一个新的文件中,即

^{pr2}$

在python中有什么简单的方法可以做到这一点吗?我已经和它战斗了一整天,但还没有真正取得任何进展,我将非常感谢你的帮助。在


Tags: 文件数据库序列rnabpxxftoc
3条回答

这里有一个健壮的解决方案,它在数据库文件中搜索基因列表,以fasta格式打印结果,并列出未找到的基因。在

请注意,数据库中可能存在同一基因名的多个序列记录,因此您可能需要额外的筛选来获得您希望获得的序列。在

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

data = "embl.dat"  #Path to EMBL database file
search = "gene_names.txt" #Path to file with search terms

#Load the search terms from file and strip linefeed characters
search_genes = open(search, 'r').read().splitlines() 
found_genes = []

#Search the EMBL database file
for record in SeqIO.parse(open(data, 'r'), 'embl'):
    UTR5 = [feature for feature in record.features if feature.type=="5'UTR"]
    for utr5feature in UTR5:
        for s in search_genes:
            genes = utr5feature.qualifiers['gene']
            if s in genes:
                found_genes.append(s)
                #Gene found. Print a modified copy of the record in the desired format
                print SeqRecord(record.seq, id="_".join(genes), name=record.name,
                                description=record.description).format('fasta')

#List any search terms that were not found in the database
for s in search_genes:
    if s not in found_genes:
        print s+" NOT FOUND IN DATABASE!"

我用biopython解析embl文件并提取信息

from Bio import SeqIO

input = "test.embl"  #change your input, here

#next if you had one sequence in the input file
seq = SeqIO.parse(open(input), "embl").next()

UTR5 = [feature for feature in seq.features if feature.type=="5'UTR"]

#you have only one 5'utr
genes = UTR5[0].qualifiers['gene']
#you get ['Ngp']

#Create SeqRecord
from Bio.SeqRecord import SeqRecord

#you may remove description, if not required
new_record = SeqRecord(seq.seq, id= "_".join(genes), 
             name=seq.name, description=seq.description)

print new_record.format("fasta")

你会得到:

^{pr2}$

这是EMBL格式:

ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

您可以使用BioPython SeqIO从中提取信息:

http://biopython.org/wiki/SeqIO

相关问题 更多 >