减少处理合并表和词典的时间

2024-10-01 17:27:34 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我的字典:

dict_assembly = {'ind1gene1':'individual1', 'ind1gene2':'individual1','ind1gene3':'individual1', 'ind2gene1':'individual2', 'ind2gene2':'individual2','ind2gene3':'individual2', 'ind3gene1':'individual3', 'ind3gene2':'individual3','ind3gene3':'individual3','ind4gene1':'individual4','ind4gene2':'individual4','ind4gene3':'individual4','ind4gene4':'individual4'} 

dict_bhit = {'ind1gene1':'AAAAA', 'ind1gene2':'BBBBB','ind1gene3':'CCCCC', 'ind2gene1':'AAAAA', 'ind2gene2':'BBBBB','ind2gene3':'BBBBB', 'ind3gene1':'AAAAA', 'ind3gene2':'BBBBB','ind3gene3':'CCCCC','ind4gene1':'AAAAA','ind4gene2':'BBBBB','ind4gene3':'CCCCC','ind4gene4':'DDDDD'}

dict_identity = {'ind1gene1':'98','ind2gene1':'96','ind3gene1':'95','ind4gene1':'96','indi5gene1':'94','ind1gene2':'67','ind2gene2':'76','ind3gene2':'80','ind4gene2':'77','ind5gene2':'76','ind1gene3':'98','ind2gene3':'97','ind3gene3':'96','ind4gene3':'96','ind4gene4':'40'}

data = {} # temporary dictionary

本例中使用的代码分为两个块

第一部分:

    import pandas as pd
    import time
    start = time.time()
    matrix_file = open("concatenated.matrix", "w" )
    col_subject = ['query', 'subject']
    df_accession = pd.DataFrame(dict_bhit.items(), columns=col_subject)
    col_genome = ['query', 'genome']
    df_assembly = pd.DataFrame(dict_assembly.items(), columns=col_genome)
    df_assembly['subject'] = df_assembly['query'].map(df_accession.set_index('query')['subject'])
    matrix = pd.get_dummies(df_assembly.set_index('genome')['subject']).max(level=0).max(level=0, axis=1)
    matrix.to_csv(matrix_file, sep='\t', header=True, index=True)
    print matrix
    end = time.time()
    print 'This step spent',round(end - start, 4), 'seconds\n'

第二部分:

start = time.time()
matrix_file = open("identity.matrix", "w" )
col_bhit = ['gene', 'subject']
df_bmatch =  pd.DataFrame(dict_bhit.items(), columns=col_bhit)  # convert "dict_bhit" into a dataframe
col_file = ['gene', 'assembly']
df_origin = pd.DataFrame(dict_assembly.items(), columns=col_file)   # convert "dict_assembly" into a dataframe
col_percent = ['gene', 'percent']
df_percent = pd.DataFrame(dict_identity.items(), columns=col_percent)   # convert "dict_bhit" into a dataframe

for k, col in dict_assembly.items():
    if k in dict_bhit and k in dict_identity:
        data.setdefault(dict_bhit[k], {})[col] = dict_identity[k]
    elif k in dict_bhit and k not in dict_identity:
        data.setdefault(dict_bhit[k], {})[col] = "NA"
    df = pd.DataFrame(data)
df.to_csv(matrix_file, sep='\t', header=True, index=True)
print df

end = time.time()
print 'This step spent',round(end - start, 4), 'seconds\n'

关于如何减少生成第二个表的处理时间有什么建议吗?如你所见,时间的值有两个不同的倍数

Saving presence/absence table ...
             AAAAA  BBBBB  CCCCC  DDDDD
genome                                 
individual1      1      1      1      0
individual2      1      1      0      0
individual3      1      1      1      0
individual4      1      1      1      1
This step spents 0.0084 seconds

Saving identity table...
            AAAAA BBBBB CCCCC DDDDD
individual1    98    67    98   NaN
individual2    96    76   NaN   NaN
individual3    95    80    96   NaN
individual4    96    77    96    40
This step spents 0.0106 seconds

Tags: dataframedftimeassemblycolmatrixdictidentity
1条回答
网友
1楼 · 发布于 2024-10-01 17:27:34

为了解决这个问题并在一个大数据集中停留几秒钟,我在“elif”(选项1)处注释了两行

方案1:

for k, col in dict_assembly.items():
    if k in dict_bhit and k in dict_identity:
        data.setdefault(dict_bhit[k], {})[col] = dict_identity[k]

    #elif k in dict_bhit and k not in dict_identity:
        #data.setdefault(dict_bhit[k], {})[col] = "NA"

    df = pd.DataFrame(data)
df.to_csv(matrix_file, sep='\t', header=True, index=True)
print df

对于小数据集,可以使用Option2直接删除“if”条件

Option 2:
for k, col in dict_assembly.items():

    data.setdefault(dict_bhit[k], {})[col] = dict_identity[k]

    df = pd.DataFrame(data)
df.to_csv(matrix_file, sep='\t', header=True, index=True)
print df

相关问题 更多 >

    热门问题