在Python中从一个文件到多个字典

2024-09-28 18:56:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试编写一个Python脚本,它以一种特殊类型的文件作为输入。
这个文件包含关于多个基因的信息,一个基因的信息被写在多个行上,其中每个基因的行数并不相同。例如:

 gene            join(373616..374161,1..174)
                 /locus_tag="AM1_A0001"
                 /db_xref="GeneID:5685236"
 CDS             join(373616..374161,1..174)
                 /locus_tag="AM1_A0001"
                 /codon_start=1
                 /transl_table=11
                 /product="glutathione S-transferase, putative"
                 /protein_id="YP_001520660.1"
                 /db_xref="GI:158339653"
                 /db_xref="GeneID:5685236"
                 /translation="MKIVSFKICPFVQRVTALLEAKGIDYDIEYIDLSHKPQWFLDLS
                 PNAQVPILITDDDDVLFESDAIVEFLDEVVGTPLSSDNAVKKAQDRAWSYLATKHYLV
                 QCSAQRSPDAKTLEERSKKLSKAFGKIKVQLGESRYINGDDLSMVDIAWLPLLHRAAI
                 IEQYSGYDFLEEFPKVKQWQQHLLSTGIAEKSVPEDFEERFTAFYLAESTCLGQLAKS
                 KNGEACCGTAECTVDDLGCCA"
 gene            241..381
                 /locus_tag="AM1_A0002"
                 /db_xref="GeneID:5685411"
 CDS             241..381
                 /locus_tag="AM1_A0002"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_001520661.1"
                 /db_xref="GI:158339654"
                 /db_xref="GeneID:5685411"
                 /translation="MLINPEDKQVEIYRPGQDVELLQSPSTISGADVLPEFSLNLEWI
                 WR"
 gene            388..525
                 /locus_tag="AM1_A0003"
                 /db_xref="GeneID:5685412"
 CDS             388..525
                 /locus_tag="AM1_A0003"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_001520662.1"
                 /db_xref="GI:158339655"
                 /db_xref="GeneID:5685412"
                 /translation="MKEAGFSENSRSREGQPKLAKDAAIAKPYLVAMTAELQIMATET
                 L"

我现在想要的是创建一个字典列表,每个字典都包含一个基因的信息,比如:

^{pr2}$

我完全不知道如何才能让Python知道一个基因/字典何时完成,下一个应该开始。
有人能帮帮我吗?有办法吗?在

澄清一下:我知道如何提取我想要的信息,将其保存在变量中,并将其放入字典。我只是不知道如何告诉Python为每个基因创建一个字典。在


Tags: 信息db字典tag基因startgenecodon
2条回答

如果有人对我在我收到的评论的帮助下找到的初学者解决方案感兴趣,这里是:

import sys, re

annot = file("example.embl", "r")
embl = ""
annotation = []

for line in annot:
    embl += line

embl_list = embl.split("FT   gen")

for item in embl_list:
    if "e            " in item:
        split_item = item.split("\n")
        for l in split_item:
            if "e            " in l:
                if not "complement" in l:
                    coordinates = l[13:len(l)]
                    C = coordinates.split("..")
                    genestart = C[0]
                    geneend = C[1]
                    strand = "+"
                if "complement" in l:
                    coordinates = l[24:len(l)-1]
                    C = coordinates.split("..")
                    genestart = C[0]
                    geneend = C[1]
                    strand = "-"

            if "/locus_tag" in l:
                L = l.split('"')
                locus = L[1]

            if "/product" in l:
                P = l.split('"')
                product = P[1]

        annotation.append({
            "locus": locus,
            "genestart": genestart,
            "geneend": geneend,
            "product": product,
        })
    else:
        print "Finished!"

我为这个纯python构建了一个也许不太好但功能强大的解析器,也许它至少可以用作一个基本思想:

import re
import pprint
printer = pprint.PrettyPrinter(indent=4)

with open("entities.txt", "r") as file_obj:
    entities = list()

    for line in file_obj.readlines():
        line = line.replace('\n', '')

        if re.match(r'\s*(gene|CDS)\s+[\w(\.,)]+', line):
            parts = line.split()
            entity = {parts[0]: parts[1]}
            entities.append(entity)
        else:
            try:
                (attr_name,) = re.findall(r'/\w+=', line)
                attr_name = attr_name.strip('/=')
            except ValueError:
                addition = line.strip()
                entity[last_key] = ''.join([entity[last_key], addition])
            else:
                try:
                    (attr_value,) = re.findall(r'="\w+$', line)
                    last_key = attr_name
                except ValueError:
                    try:
                        (attr_value,) = re.findall(r'="[\w\s\.:,-]+"', line)
                    except ValueError:
                        (attr_value,) = re.findall(r'=\d+$', line)

                    attr_value = attr_value.strip('"=')

                if attr_name in entity:
                    entity[attr_name] = [entity[attr_name], attr_value]
                else:
                    entity[attr_name] = attr_value

printer.pprint(entities)

相关问题 更多 >