用biopython从gb文件中提取数据

2024-10-01 13:46:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个gb文件,我需要从文件中提取一些特定的特征:蛋白质编码基因的名称和大小。在

LOCUS       NC_008137              15318 bp    DNA     linear   MAM 15-APR-2009
DEFINITION  Phalanger interpositus mitochondrion, complete genome.
ACCESSION   NC_008137
VERSION     NC_008137.1  GI:108793518
DBLINK      Project: 17043
KEYWORDS    .
SOURCE      mitochondrion Phalanger interpositus (Stein's cuscus)
  ORGANISM  Phalanger interpositus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Metatheria; Diprotodontia; Phalangeridae; Phalanger.
REFERENCE   1  (bases 1 to 15318)
  AUTHORS   Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
            Hasegawa,M.
  TITLE     Phylogenetic analysis of diprotodontian marsupials based on
            complete mitochondrial genomes
  JOURNAL   Genes Genet. Syst. 81 (3), 181-191 (2006)
   PUBMED   16905872
REFERENCE   2  (bases 1 to 15318)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (12-JUN-2006) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 15318)
  AUTHORS   Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
            Hasegawa,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-NOV-2005) Tokyo Institute of Technology, Graduate
            School of Bioscience and Biotechnology; Nagatsuta-cho 4259-B-21,
            Midori-ku, Kanagawa 226-8501, Japan
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from AB241057.
            Genome sequence lacks part of non-coding region.
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..15318
                     /organism="Phalanger interpositus"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:356347"
                     /tissue_type="liver"
                     /common="Stein's cuscus"
     tRNA            1..69
                     /product="tRNA-Phe"
     rRNA            72..1018
                     /product="s-rRNA"
                     /note="12S ribosomal RNA"
     tRNA            1020..1088
                     /product="tRNA-Val"
     rRNA            1089..2653
                     /product="l-rRNA"
                     /note="16S ribosomal RNA"
     tRNA            2654..2727
                     /product="tRNA-Leu"
                     /codon_recognized="UUR"
     gene            2729..3685
                     /gene="ND1"
                     /db_xref="GeneID:4117948"
     CDS             2729..3685
                     /gene="ND1"
                     /codon_start=1
                     /transl_table=2
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="YP_637062.1"
                     /db_xref="GI:108793519"
                     /db_xref="GeneID:4117948"
                     /translation="MFIINLLMYIIPILLAIAFLTLVERKALGYMQFRKGPNVVGPYG
                     LLQPIADGMKLFSKEPLQPVTSSTTMFIIAPTLALTLSLTMWTPLPMPHSLIDLNLGL
                     LFILALSGLSVYSILWSGWASNSKYALMGALRAVAQTISYEVTLAIILLSIMLINGSF
                     TLKNLITTQENMWLIITTWPLVMMWYVSTLAETNRAPLDLTEGESELVSGFNVEYAAG
                     PFAMFFLAEYANIMLMNAMTTILFLGSSINHNFTHLNTLSFMTKTIALTFLFLWVRAS
                     YPRFRYDQLMHLLWKNFLPMTLAMCLWFISIPIALSCIPPQI"
     misc_feature    2729..3682
                     /gene="ND1"
                     /note="NADH dehydrogenase; Region: NADHdh; cl00469"
                     /db_xref="CDD:186018"
     tRNA            3686..3751
                     /product="tRNA-Ile"
     tRNA            complement(3750..3821)
                     /product="tRNA-Gln"
     tRNA            3821..3878
                     /product="tRNA-Met"
     gene            3889..4932
                     /gene="ND2"
                     /db_xref="GeneID:4117949"
     CDS             3889..4932
                     /gene="ND2"
                     /codon_start=1
                     /transl_table=2
                     /product="NADH dehydrogenase subunit 2"
                     /protein_id="YP_637063.1"
                     /db_xref="GI:108793520"
                     /db_xref="GeneID:4117949"
                     /translation="MSPYILLIMLTSLLLGTSLTLFSNHWLTAWMGLEINTLAIIPMM
                     TYPNHPRATESAIKYFLTQSTASMMLMFAIINNAWMTNQWTLLQTSDQTSSTIMTLAL
                     AMKLGLAPFHFWVPEVTQGIPLTSGMILLTWQKIAPTSLMYQISPSLNMKILVMLALL
                     STILGGWGGLNQTHMRKILAYSSIAHMGWMTIIILINPTLTLLNLAIYITTTLTLFLA
                     LNHSSITKIKSLANLWNKSSSMTIVIALTLLSLGGLPPLTGFMPKWLILQELITYNNI
                     ATATMMAMSALLNLFFYMRIIYTTTLTMPPSINNSKLQWPHPQTKTTNIIPLLTIISS
                     FLLPLTPLSITLS"

我使用了seqFeature和subfeatures,但它不起作用。在

从这个文件我应该得到(ND1和2729..3685,ND2和3889..4932。。。如果还有更多)

我是新来的生物制品,想帮助如何做到这一点。在


Tags: ofdbproductreferencencgenegitrna
1条回答
网友
1楼 · 发布于 2024-10-01 13:46:17

您发布的genbank文件不完整,缺少部分并且没有//终止行。然后解析器在试图读取它时会卡住。在

我从here得到了Phalanger interpositus线粒体的正确文件。
然后(py3k代码):

>>> 
>>> from Bio import SeqIO
>>> arch = "C:/code/NC_008137.gbk"
>>> record = SeqIO.parse(arch, "genbank")
>>> rec = next(record)                       # there is only one record
>>> for f in rec.features:
    if f.type == 'gene':
        print(f.qualifiers['gene'], f.location)


['ND1'] [2728:3685]
['ND2'] [3888:4932]
['COX1'] [5365:6919]
['COX2'] [7052:7737]
['ATP8'] [7798:8005]
['ATP6'] [7959:8640]
['COX3'] [8639:9423]
['ND3'] [9488:9837]
['ND4L'] [9906:10203]
['ND4'] [10196:11574]
['ND5'] [11773:13582]
['ND6'] [13578:14082]
['CYTB'] [14155:15301]
>>> 

相关问题 更多 >