如何提高数百个文件中数千行的解析效率

HumanGene OriginalGene Intercept age pval 2314 14248 5.3e-15 0.99 3.5e-33 2395 14297 15.76 -0.05 0.59 10672 14674 7.25 0.19 0.58 8683 108014 21.63 -1.74 0.43 5075 18503 -6.34 1.58 0.19

for each gene in ListOfHumanGenes: open each of the 100 files labelled ".HumanHomologs" if the gene name is present: NumberOfTrials +=1 if the p-val is <0.05: if the "Age" column < 0: UnderexpressedSuccess +=1 elif "Age" column > 0: OverexpressedSuccess +=1 print each_gene + "\t" + NumberOfTrials + "\t" UnderexpressedSuccess print each_gene + "\t" + NumberOfTrials + "\t" OverexpressedSuccess

for each_item in ListOfHumanGenes: OverexpressedSuccess = 0 UnderexpressedSuccess = 0 NumberOfTrials = 0 for each_file in glob.glob("*.HumanHomologs"): open_each_file = open(each_file).readlines()[1:] for line in open_each_file: line = line.strip().split() if each_item == line[0]: NumberOfTrials +=1 #i.e if the gene is in the file, NumberOfTrials +=1. Not every gene is guaranteed to be in every file if line[-1] != "NA": if float(line[-1]) < float(0.05): if float(line[-2]) < float(0): UnderexpressedSuccess +=1 elif float(line[-2]) > float(0): OverexpressedSuccess +=1 underexpr_output_file.write(each_item + "\t" + str(UnderexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(UnderProbability) +"\n") #Note: the "Underprobabilty" float is obtained earlier in the script overexpr_output_file.write(each_item + "\t" + str(OverexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(OverProbability) +"\n") #Note: the "Overprobability" float is obtained earlier in the script overexpr_output_file.close() underexpr_output_file.close()

2条回答

网友

1楼 · 编辑于 2024-07-05 15:16:53

感谢您的帮助；交换循环的洞察力是非常宝贵的。改进的、更有效的脚本如下所示：（注意：我现在没有了人类基因列表（如上所述），而是有了人类基因的DictOfHumanGenes，其中每个键都是人类基因，值是（1）NumberOfTrials，（2）UnderexpressedSuccess和（3）OverexpressedSuccess的列表；这也加快了代码的其他部分）：

for each_file in glob.glob("*.HumanHomologs"):
    open_each_file = open(each_file).readlines()[1:]
    for line in open_each_file:
        line = line.strip().split()
        if line[0] in DictOfHumanGenes: 
            DictOfHumanGenes[line[0]][0] +=1  #This is the Number of trials
            if line[-1] != "NA":
                if float(line[-1]) < float(0.05):
                    if float(line[-2]) < float(0):
                        DictOfHumanGenes[line[0]][1] +=1  #This is the UnexpressedSuccess
                    elif float(line[-2]) > float(0):
                        DictOfHumanGenes[line[0]][2] +=1  #This is the OverexpressedSuccess

我现在正在研究pandas，看看如何整合它，如果我能让pandas的代码更高效，我会把答案贴在这里。你知道吗

网友

2楼 · 编辑于 2024-07-05 15:16:53

你的算法会把这100个文件打开1000次。立即想到的优化是将文件作为最外层的循环进行迭代，这将确保每个文件只打开一次。然后检查每个基因的存在并记录下你想要的任何其他记录。你知道吗

此外，熊猫模块将非常方便地处理这种csv文件。看看Pandas

相关问题更多 >

编程相关推荐

热门问题

热门文章