如何提高数百个文件中数千行的解析效率问题的回答

如何提高数百个文件中数千行的解析效率

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我写了一个剧本，但速度太慢了。我想知道是否有人能建议如何加速。剧本中我觉得太慢的部分是这样的： <ol> <li>我有一个包含1000个人类基因名称的列表（每个基因名称都是一个数字），读入一个名为“ListOfHumanGenes”的列表中；例如，列表的开头如下所示： <code>[2314,2395,10672,8683,5075]</code></li> <li>我有100个这样的文件，扩展名都是“.humanhomolors”： <pre><code>HumanGene OriginalGene Intercept age pval 2314 14248 5.3e-15 0.99 3.5e-33 2395 14297 15.76 -0.05 0.59 10672 14674 7.25 0.19 0.58 8683 108014 21.63 -1.74 0.43 5075 18503 -6.34 1.58 0.19 </code></pre></li> <li>脚本这一部分的算法是（用英语，不是代码）：</li> </ol> <blockquote> <pre><code>for each gene in ListOfHumanGenes: open each of the 100 files labelled ".HumanHomologs" if the gene name is present: NumberOfTrials +=1 if the p-val is <0.05: if the "Age" column < 0: UnderexpressedSuccess +=1 elif "Age" column > 0: OverexpressedSuccess +=1 print each_gene + "\t" + NumberOfTrials + "\t" UnderexpressedSuccess print each_gene + "\t" + NumberOfTrials + "\t" OverexpressedSuccess </code></pre> </blockquote> 本节代码为： <pre><code>for each_item in ListOfHumanGenes: OverexpressedSuccess = 0 UnderexpressedSuccess = 0 NumberOfTrials = 0 for each_file in glob.glob("*.HumanHomologs"): open_each_file = open(each_file).readlines()[1:] for line in open_each_file: line = line.strip().split() if each_item == line[0]: NumberOfTrials +=1 #i.e if the gene is in the file, NumberOfTrials +=1. Not every gene is guaranteed to be in every file if line[-1] != "NA": if float(line[-1]) < float(0.05): if float(line[-2]) < float(0): UnderexpressedSuccess +=1 elif float(line[-2]) > float(0): OverexpressedSuccess +=1 underexpr_output_file.write(each_item + "\t" + str(UnderexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(UnderProbability) +"\n") #Note: the "Underprobabilty" float is obtained earlier in the script overexpr_output_file.write(each_item + "\t" + str(OverexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(OverProbability) +"\n") #Note: the "Overprobability" float is obtained earlier in the script overexpr_output_file.close() underexpr_output_file.close() </code></pre> 这将生成两个输出文件（一个用于over，一个用于under expressed），如下所示：列为GeneName、#Overexpressed/#under expressed、#NumberTrials，然后可以忽略最后一列： <pre><code>2314 8 100 0.100381689982 2395 14 90 0.100381689982 10672 10 90 0.100381689982 8683 8 98 0.100381689982 5075 5 88 0.100381689982 </code></pre> 每个“.humanhomolors”文件中都有&gt；8000行，基因列表约有20000个基因长。所以我理解这很慢，因为对于20000个基因中的每一个，它打开100个文件，在每个文件的8000个基因中找到这个基因。我想知道是否有人可以建议编辑我可以使这个脚本更快/更有效？你知道吗

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如何提高数百个文件中数千行的解析效率

1 个回答

相关Python问题