Python基于多个键比较巨大的文本文件

matched = open('matchedrecords.txt','w') with open('srcone.txt') as b: blines = set(b) with open('srctwo.txt') as a: alines = set(a) with open('notInfirstSource.txt', 'w') as result: for line in alines: if line not in blines: result.write(line) else: matched.write(line) with open('notInsecondSource.txt', 'w') as non: for lin in blines: if lin not in alines: non.write(lin) matched.close()

3条回答

网友

1楼 · 编辑于 2024-09-24 00:33:57

这是一种基于键/列比较行的方法，但我不确定它有多有效。在

 matched =open('matchedrecords.txt','w')
    with open('srcone.txt') as b:
      blines = set(b)
    with open('srctwo.txt') as a:
      alines= set(a)

        # List of columns or keys to compare
        list_of_columns_to_compare=[7,8,9]

        a_columns=[]
        b_columns=[]

        for blin in blines :
           for alin in alines:
               for column_no in list_of_columns_to_compare :
                   # Appending columns  to a list to compare
                   b_columns.append(blin.split('|')[column_no])
                   a_columns.append(alin.split('|')[column_no])

                   if a_columns == b_columns:
                       matched.write(blin + " = " + alin)

网友

2楼 · 编辑于 2024-09-24 00:33:57

最后，我可以使用字典在很短的时间内实现这一点。 i、 e一个370 MB的数据与270MB的数据文件相比，最多50秒（使用元组作为键）。脚本如下：

   reader = open("fileA",'r')
    reader2 = open("fileB",'r')
    TmpDict ={}
    TmpDict2={}
    for line in reader:
        line = line.strip()
        TmpArr=line.split('|')
       #Forming a dictionary with below columns as keys
        TmpDict[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
    for line in reader2:
        line = line.strip()
        TmpArr=line.split('|')
        TmpDict2[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
    outfile = open('MatchedRecords.txt', 'w')
    outfileNonMatchedB=open('notInB','w')
    outfileNonMatchedA=open('notInA','w')
    for k,v in TmpDict.iteritems():
        if k in TmpDict2:
            outfile.write(v+ '\n')
        else:
            outfileNonMatchedB.write(v+'\n')
    outfile.close()
    outfileNonMatchedB.close()
    for k,v in TmpDict2.iteritems():
        if k not in TmpDict:
            outfileNonMatchedA.write(v+'\n')
    outfileNonMatchedA.close()

有什么可以改进的吗？建议我！谢谢

网友

3楼 · 编辑于 2024-09-24 00:33:57

根据ActiveState上的recipe for KeyedSets的提示，您可以构建一个集合，然后简单地使用set intersection和set difference来生成结果：

import collections

class Set(collections.Set):
    @staticmethod
    def key(s): return tuple(s.split('|')[6:10])
    def __init__(self, it): self._dict = {self.key(s):s for s in it}
    def __len__(self): return len(self._dict)
    def __iter__(self): return self._dict.itervalues()
    def __contains__(self, value): return self.key(value) in self._dict

data = {}
for filename in 'srcone.txt', 'srctwo.txt':
    with open(filename) as f:
        data[filename] = Set(f)

with open('notInFirstSource.txt', 'w') as f:
    for lines in data['srctwo.txt'] - data['srcone.txt']:
        f.write(''.join(lines))

with open('notInSecondSource.txt', 'w') as f:
    for lines in data['srcone.txt'] - data['srctwo.txt']:
        f.write(''.join(lines))

with open('matchedrecords.txt', 'w') as f:
    for lines in data['srcone.txt'] & data['srctwo.txt']:
        f.write(''.join(lines))

相关问题更多 >

编程相关推荐

热门问题

热门文章