在python中比较两个文件并重新排序列

abc6917833 abc3756968 0.817637 abc6920271 abc3756968 0.984551 abc4275081 abc3756968 0.834717 abc2209154 **abc3756968** 0.8642 abc10457594 **abc3756968** 0.763052 **abc3756968** abc9493450 1 **abc3756968** abc9483504 1 abc3756968 abc9389034 0.775731 abc3756968 abc2236381 0.775731 abc3756968 abc2236382 0.775731 abc3756968 abc9399043 0.994849 abc3756968 abc9321374 0.764408 abc3756968 abc9399044 0.775731 abc3756968 abc7452703 1 abc3756968 abc4599669 0.775731 abc6917833 abc9483504 0.817637 abc6920271 abc9483504 0.984551 abc4275081 abc9483504 0.834717 abc2209154 abc9483504 0.8642 abc10457594 abc9483504 0.763052 abc3756968 abc9483504 1 abc9493450 abc9483504 1 abc9483504 abc9389034 0.775731 abc9483504 abc2236381 0.775731 abc9483504 abc2236382 0.775731 abc9483504 abc9399043 0.994849 abc9483504 abc9321374 0.764408 abc9483504 abc9399044 0.775731 abc9483504 abc7452703 1

abc3756968 abc6917833 0.817637 abc3756968 abc6920271 0.984551 abc3756968 abc4275081 0.834717 abc3756968 abc2209154 0.8642 abc3756968 abc10457594 0.763052 abc3756968 abc9493450 1 abc3756968 abc9483504 1 abc3756968 abc9389034 0.775731 abc3756968 abc2236381 0.775731 abc3756968 abc2236382 0.775731 abc3756968 abc9399043 0.994849 abc3756968 abc9321374 0.764408 abc3756968 abc9399044 0.775731 abc3756968 abc7452703 1 abc3756968 abc4599669 0.775731 abc3756968 abc9483504 1 abc9483504 abc3756968 1 abc9483504 abc6917833 0.817637 abc9483504 abc6920271 0.984551 abc9483504 abc4275081 0.834717 abc9483504 abc2209154 0.8642 abc9483504 abc10457594 0.763052 abc9483504 abc3756968 1 abc9483504 abc9493450 1 abc9483504 abc9389034 0.775731 abc9483504 abc2236381 0.775731 abc9483504 abc2236382 0.775731 abc9483504 abc9399043 0.994849 abc9483504 abc9321374 0.764408 abc9483504 abc9399044 0.775731 abc9483504 abc7452703 1

rs_dict={} with open("file1") as rs: for line in rs: rs_dict[line.strip()]=1 for rs in rs_dict.keys(): with open("file2") as ld: for line in ld: if rs in line.strip().split(): if rs==line.strip().split()[0]: print line.strip() else: print line.strip().split()[1]+"\t"+line.strip().split()[0]+"\t"+line.strip().split()[2]

2条回答

网友

1楼 · 编辑于 2024-10-02 20:37:16

您的算法很慢，因为它完全通过file2循环file1中的每个ID，即O(n*m)。你知道吗

相反，您应该循环一次file2，同时存储数据，然后迭代file1，并显示相应的元素，即O(n+m)。你知道吗

注意，还可以使用defaultdict和EAFP来避免检查字典中是否已经存在键。你知道吗

from collections import defaultdict

data = defaultdict(list)

with open("file2") as f2:
    for line in f2:
        id1, id2, val = line.strip().split()
        data[id1].append((id2, val))
        data[id2].append((id1, val))

with open("file1") as f1:
    for line in f1:
        id = line.strip()
        try:
            for a, b in data[id]:
                print("%s %s %s" % (id, a, b))
        except KeyError:
            pass

网友

2楼 · 编辑于 2024-10-02 20:37:16

在第二个循环中，打开和读取file2的次数与读取file1中的键的次数相同。慢可能与底层操作系统（而不是）缓存file2的内容有关。你知道吗

文件2有多大？如果小于计算机上RAM中合理存储的容量（通常为几百MB），请尝试自己缓存：

f = open("file2", "r")
cache = []
for line in f: cache.append(line )
# you now have cache, a list of lines from file2

然后从第二个块中删除with，并用for line in cache替换第二个for

这仍然是严重的次优。最好还是从file2的内容构建一个python dict，这样就可以只访问所需的行，而不是扫描所有行。像这样的

cache = {}
f = open("file2", "r")
for line in f:
    t = line.strip().split()
    key1 = t[0]
    if not key1 in cache: cache[key1] = []
    cache[key1].append(line)
    key2 = t[1]
    if not key2 in cache: cache[key2] = []
    cache[key2].append(line)

几乎重复的代码以简化理解。一般来说，您会在split（）生成的行中的单词上运行一个内部循环。你知道吗

现在第二段代码变得简单得多。大纲：

for rs in rs_dict.keys():
    if rs in cache:
       cached_lines = cache[rs]
       # cached_lines is a list of one or more lines containing rs
       # as the first or second word
    else
       # rs wasn't in file2 at all

由于Python dict使用按键定位实体的数据结构，比检查列表中的每一项要快得多。你知道吗

为了完整性，如果file1和file2都是巨大的（千兆字节以上），您应该将它们的内容加载到sqlite之类的数据库中。数据库在磁盘上做的事情和dict在RAM中做的一样：通过键访问选定的元素比简单地搜索所有记录要有效得多。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章