在两个文本文件中获取唯一行

1条回答

网友

1楼 · 发布于 2024-10-03 04:35:21

I am able to do it till the sorting part but regular expressions doesn't seems to be working in multi-lines.

你的正则表达式没问题。你没有多行。您有个行：

for line in s.readlines():

file.readlines()以行列表的形式将所有文件读入内存。然后迭代这些单行，因此line将是'asd\n'或{}，并且从不'qwe\nqwe\n'。在

考虑到您正在将所有合并的文件读入内存，我将假定您的文件不是那么大。在这种情况下，只需将其中一个文件读入set对象，然后测试另一个文件的每一行以找出差异：

^{pr2}$
如果你想把这些都写进一个文件中，你可以把这两个序列组合起来，写出排序后的列表：
with open('c.txt', 'w') as file_c: file_c.writelines(sorted(list(lines) + new_in_b))

你的方法，首先对行进行排序，将它们全部放入一个文件中，然后匹配成对的行，这也是可能的。您只需记住前面的一行。加上当前线路，这是一对。请注意，对于这个，您不需要正则表达式，只需要一个等式测试：

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    preceding = None
    skip = False
    for line in file_c:
        if preceding and preceding == line:
            # skip writing this line, but clear 'preceding' so we don't
            # check the next line against it
            preceding = None
        else:
            outfile.write(preceding)
            preceding = line
    # write out the last line
    if preceding:
        outfile.write(preceding)

请注意，这不会将整个文件读入内存！直接在文件上进行迭代会给您单独的行，其中文件被分块读入缓冲区。这是一种非常有效的线处理方法。在

还可以使用^{} library启动文件对象迭代器，一次迭代两行文件：

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    iter1, iter2 = tee(file_c)  # two iterators with shared source
    line2 = next(iter2, None)  # move second iterator ahead a line
    # iterate over this and the next line, and add a counter
    for i, (line1, line2) in enumerate(zip(iter1, iter2)):
        if line1 != line2:
            outfile.write(line1)
        else:
            # clear the last line so we don't try to write it out
            # at the end
            line2 = None
    # write out the last line if it didn't match the preceding
    if line2:
        outfile.write(line2)

第三种方法是使用^{}将相等的行组合在一起。然后，您可以决定如何处理这些组：

from itertools import groupby

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    for line, group in groupby(file_c):
        # group is an iterator of all the lines in c that are equal
        # the same value is already in line, so all we need to do is
        # *count* how many such lines there are:
        count = sum(1 for line in group)  # get an efficient count
        if count == 1:
            # line is unique, write it out
            outfile.write(line)

我假设同一行有两个或两个以上的副本并不重要。换言之，您不希望配对，您只想找到唯一的行（那些只存在于a或b中）。在

如果您的文件非常大，但已经进行了排序，则可以使用合并排序方法，无需手动将两个文件合并为一个。^{} function为您提供了多个文件中按排序顺序排列的行，前提是输入被单独排序。与groupby()一起使用：

import heapq
from itertools import groupby

# files a.txt and b.txt are assumed to be sorted already
with open('a.txt', 'r') as file_a, open('b.txt', 'r') as file_b,\
        open('output.txt', 'w') as outfile:
    for line, group in groupby(heapq.merge(file_a, file_b)):
        count = sum(1 for line in group)
        if count == 1:
            outfile.write(line)

同样，这些方法只从每个文件读取足够的数据来填充缓冲区。heapq.merge()迭代器一次只在内存中保存两行，groupby()也是如此。这使您可以处理任何大小的文件，而不考虑内存限制。在

相关问题更多 >

编程相关推荐

热门问题

热门文章