如何重复更新python字典而不丢失另一个字典中的key的原始数据?

2024-10-16 17:19:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试用另一个来自第二个字典的文件中的信息来更新从一个txt文件创建的字典。我的问题是每次我试图更新它都会把我的文件缩短到 single dictionary output" : {my updated output},而不是预期的{my updated output},{my updated output}

首先尝试合并字典,基本上有两个并排的字典,然后尝试使用dictionary1.update(dictionary2[key])更新字典,它给了我“单一字典输出”。你知道吗

import re
import os
import glob
asps = []

gbFileNames = list(glob.glob(os.path.join('/Users/schneider/Downloads/Reilly/*.gb')))

gbDict = {}

for myfile in gbFileNames:
    currentfile = open(myfile, 'r')
    for line in currentfile:
        if 'ACCESSION' in line: 
            accn = line.split(' ')[-1].rstrip()
            gbDict[accn] = {'host':'','isolate':''}
        elif 'host=' in line: 
            gbDict[accn]['host'] += line.split('"')[1]
        elif 'isolate=' in line: 
            gbDict[accn]['isolate'] += line.split('"')[1]

seqFileNames = list(glob.glob(os.path.join('/Users/schneider/Downloads/Reilly/*.txt')))

fastaDict = {}

for myfile in seqFileNames:
    currentfile = open(myfile, 'r')
    for line in currentfile:
        if '>' in line:
        # DEFINE GENE ID
            pseudoGeneID = re.search('(?<=gene)\w{1,}', line)
            GeneID = pseudoGeneID.group(0)
        #   fastaDict[GeneID] = {'accn':'','host':'','isolate':'','seq':''} #initiate subdictionary after introducing GeneID variable
            fastaDict[GeneID] = {'accn':'','seq':''} #initiate subdictionary after introducing GeneID variable
            # DEFINE TAXON by accession number
            accn = line.split('|')[1].split('.')[0]
            fastaDict[GeneID]['accn'] += accn.rstrip() # assign accession ID to dictionary using += refer to rstrip down below :)
        else:
            seq = line # here we basically say that if it doesnt start with > we assume it must be a sequence, thus we call the line a seq to make more sense :) 
            fastaDict[GeneID]['seq'] += seq.rstrip()  # rstrip is used here to guarantee that any crap will not come along with your nice sequence data


    fastaDict[GeneID].update(gbDict[accn])  
print fastaDict[GeneID]

fastaDict output = GeneID{accn;seq}
gbDict output = accn{host;isolate}

预期结果:

updatedDict output = GENEID{accn;seq;host;isolate}

注意:GeneID不是唯一的,因为多个文件将具有相同的GeneID,“accn”与GeneID组合在一起是唯一的。最终,我们要为每个基因输出一个带有多个登录号的fasta文件accn'是重复多次给定多个GeneID从一个单一accn,等同于单一基因组。Host和isolate是我们要在输出行中使用的附带标识数据,以及唯一的GeneID+accn组合。你知道吗

数据结构:1个accn有多个序列,每个序列有1个基因ID、宿主和分离物。你知道吗


Tags: 文件inhostoutput字典lineseqglob