我正在尝试用另一个来自第二个字典的文件中的信息来更新从一个txt文件创建的字典。我的问题是每次我试图更新它都会把我的文件缩短到
single dictionary output" : {my updated output}
,而不是预期的{my updated output},{my updated output}
首先尝试合并字典,基本上有两个并排的字典,然后尝试使用dictionary1.update(dictionary2[key])
更新字典,它给了我“单一字典输出”。你知道吗
import re
import os
import glob
asps = []
gbFileNames = list(glob.glob(os.path.join('/Users/schneider/Downloads/Reilly/*.gb')))
gbDict = {}
for myfile in gbFileNames:
currentfile = open(myfile, 'r')
for line in currentfile:
if 'ACCESSION' in line:
accn = line.split(' ')[-1].rstrip()
gbDict[accn] = {'host':'','isolate':''}
elif 'host=' in line:
gbDict[accn]['host'] += line.split('"')[1]
elif 'isolate=' in line:
gbDict[accn]['isolate'] += line.split('"')[1]
seqFileNames = list(glob.glob(os.path.join('/Users/schneider/Downloads/Reilly/*.txt')))
fastaDict = {}
for myfile in seqFileNames:
currentfile = open(myfile, 'r')
for line in currentfile:
if '>' in line:
# DEFINE GENE ID
pseudoGeneID = re.search('(?<=gene)\w{1,}', line)
GeneID = pseudoGeneID.group(0)
# fastaDict[GeneID] = {'accn':'','host':'','isolate':'','seq':''} #initiate subdictionary after introducing GeneID variable
fastaDict[GeneID] = {'accn':'','seq':''} #initiate subdictionary after introducing GeneID variable
# DEFINE TAXON by accession number
accn = line.split('|')[1].split('.')[0]
fastaDict[GeneID]['accn'] += accn.rstrip() # assign accession ID to dictionary using += refer to rstrip down below :)
else:
seq = line # here we basically say that if it doesnt start with > we assume it must be a sequence, thus we call the line a seq to make more sense :)
fastaDict[GeneID]['seq'] += seq.rstrip() # rstrip is used here to guarantee that any crap will not come along with your nice sequence data
fastaDict[GeneID].update(gbDict[accn])
print fastaDict[GeneID]
fastaDict output = GeneID{accn;seq}
gbDict output = accn{host;isolate}
预期结果:
updatedDict output = GENEID{accn;seq;host;isolate}
注意:GeneID不是唯一的,因为多个文件将具有相同的GeneID,“accn”与GeneID组合在一起是唯一的。最终,我们要为每个基因输出一个带有多个登录号的fasta文件accn'是重复多次给定多个GeneID从一个单一accn,等同于单一基因组。Host和isolate是我们要在输出行中使用的附带标识数据,以及唯一的GeneID+accn组合。你知道吗
数据结构:1个accn有多个序列,每个序列有1个基因ID、宿主和分离物。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐