Python基于多个键比较巨大的文本文件

2024-09-24 00:33:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个大约1GB的文本文件,其中每行有60列。 每个文件中有6列是要比较的键。在

示例:

file1: 4|null|null|null|null|null|3590740374739|20077|7739662|75414741|

file2: 4|null|11|333|asdsd|null|3590740374739|20077|7739662|75414741|

这里两行相等,因为列7、8、9和10在两个文件(键)中是相同的。 我尝试了一个不考虑键的文件比较示例,这很好,但是我需要根据键进行比较,而不是每行中的字符对字符进行比较。在

下面是我在不考虑键的情况下进行比较的代码示例。在

matched = open('matchedrecords.txt','w')

with open('srcone.txt') as b:
  blines = set(b)

with open('srctwo.txt') as a:
  alines = set(a)

with open('notInfirstSource.txt', 'w') as result:
  for line in alines:
    if line not in blines:
      result.write(line)
    else:
      matched.write(line)       

with open('notInsecondSource.txt', 'w') as non:
    for lin in blines:
      if lin not in alines:
        non.write(lin)

matched.close()

Tags: 文件intxt示例aswithlineopen
3条回答

这是一种基于键/列比较行的方法,但我不确定它有多有效。在

 matched =open('matchedrecords.txt','w')
    with open('srcone.txt') as b:
      blines = set(b)
    with open('srctwo.txt') as a:
      alines= set(a)

        # List of columns or keys to compare
        list_of_columns_to_compare=[7,8,9]

        a_columns=[]
        b_columns=[]

        for blin in blines :
           for alin in alines:
               for column_no in list_of_columns_to_compare :
                   # Appending columns  to a list to compare
                   b_columns.append(blin.split('|')[column_no])
                   a_columns.append(alin.split('|')[column_no])

                   if a_columns == b_columns:
                       matched.write(blin + " = " + alin)

最后,我可以使用字典在很短的时间内实现这一点。 i、 e一个370 MB的数据与270MB的数据文件相比,最多50秒(使用元组作为键)。 脚本如下:

   reader = open("fileA",'r')
    reader2 = open("fileB",'r')
    TmpDict ={}
    TmpDict2={}
    for line in reader:
        line = line.strip()
        TmpArr=line.split('|')
       #Forming a dictionary with below columns as keys
        TmpDict[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
    for line in reader2:
        line = line.strip()
        TmpArr=line.split('|')
        TmpDict2[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
    outfile = open('MatchedRecords.txt', 'w')
    outfileNonMatchedB=open('notInB','w')
    outfileNonMatchedA=open('notInA','w')
    for k,v in TmpDict.iteritems():
        if k in TmpDict2:
            outfile.write(v+ '\n')
        else:
            outfileNonMatchedB.write(v+'\n')
    outfile.close()
    outfileNonMatchedB.close()
    for k,v in TmpDict2.iteritems():
        if k not in TmpDict:
            outfileNonMatchedA.write(v+'\n')
    outfileNonMatchedA.close()

有什么可以改进的吗?建议我! 谢谢

根据ActiveState上的recipe for KeyedSets的提示,您可以构建一个集合,然后简单地使用set intersection和set difference来生成结果:

import collections

class Set(collections.Set):
    @staticmethod
    def key(s): return tuple(s.split('|')[6:10])
    def __init__(self, it): self._dict = {self.key(s):s for s in it}
    def __len__(self): return len(self._dict)
    def __iter__(self): return self._dict.itervalues()
    def __contains__(self, value): return self.key(value) in self._dict

data = {}
for filename in 'srcone.txt', 'srctwo.txt':
    with open(filename) as f:
        data[filename] = Set(f)

with open('notInFirstSource.txt', 'w') as f:
    for lines in data['srctwo.txt'] - data['srcone.txt']:
        f.write(''.join(lines))

with open('notInSecondSource.txt', 'w') as f:
    for lines in data['srcone.txt'] - data['srctwo.txt']:
        f.write(''.join(lines))

with open('matchedrecords.txt', 'w') as f:
    for lines in data['srcone.txt'] & data['srctwo.txt']:
        f.write(''.join(lines))

相关问题 更多 >