如何在python中更快地比较文件？

import csv output =[] a = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf', 'r') list1 = a.readlines() reader1 = a.read() b = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf', 'r') list2 = b.readlines() reader2 = b.read() f3 = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf', 'w') for line1 in list1: separar = line1.split("\t") gene = separar[2] for line2 in list2: separar2 = line2.split("\t") gene2 = separar2[2] if gene == gene2: print line1 f3.write(line1)

1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout 1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout 1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout

1条回答

网友

1楼 · 发布于 2024-10-03 21:33:25

如果将行存储在由感兴趣的列设置关键字的字典中，则可以轻松地使用Python的内置set函数（以C速度运行）来查找匹配的行。我测试了一个稍微修改过的版本（文件名发生了变化，并且由于stackoverflow格式的原因将split('\t')改为split()），看起来效果不错：

import collections

# Use 'rb' to open files

infn1 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf'
infn2 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf'
outfn = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf'

def readfile(fname):
    '''
    Read in a file and return a dictionary of lines, keyed by the item in the second column
    '''
    results = collections.defaultdict(list)
    # Read in binary mode   it's quicker
    with open(fname, 'rb') as f:
        for line in f:
            parts = line.split("\t")
            if not parts:
                continue
            gene = parts[2]
            results[gene].append(line)
    return results

dict1 = readfile(infn1)
dict2 = readfile(infn2)

with open(outfn, 'wb') as outf:
    # Find keys that appear in both files
    for key in set(dict1) & set(dict2):
        # For these keys, print all the matching
        # lines in the first file
        for line in dict1[key]:
            print(line.rstrip())
            outf.write(line)

相关问题更多 >

编程相关推荐

热门问题

热门文章