在python中如何过滤大文件中两行的重叠

for i=1:(n-1) # n is a half of the number of rows of the big file for j=(i+1):n if overlap degrees of the ith two rows and jth two rows is more than 0.25 delete the jth two rows from the big file end end end

with open("iuputfile.txt") as fileobj: sets = [set(line.split()) for line in fileobj] for first_index in range(len(sets) - 4, -2, -2): c=len(sets[first_index])*len(sets[first_index+1]) for second_index in range(len(sets)-2 , first_index, -2): d=len(sets[second_index])*len(sets[second_index+1]) ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1]) if (ab/(c+d-ab))>0.25: del sets[second_index] del sets[second_index+1] with open("outputfile.txt", "w") as fileobj: for set_ in sets: # order of the set is undefined, so we need to sort each set output = " ".join(set_) fileobj.write("{0}\n".format(output))

2条回答

网友

1楼 · 编辑于 2024-09-26 21:53:20

我一直在考虑如何以更好的方式解决这个问题，不需要所有的反转和索引之类的东西，我想出了一个解决方案，它更长、更复杂，但更容易阅读，更漂亮，更易于维护和扩展，IMHO。在

首先，我们需要一种特殊类型的列表，我们可以“正确地”迭代该列表，即使其中的一项被删除。Here是一篇关于列表和迭代器是如何工作的更详细的博客文章，阅读它将帮助您理解这里发生的事情：

class SmartList(list):
    def __init__(self, *args, **kwargs):
        super(SmartList, self).__init__(*args, **kwargs)
        self.iterators = []

    def __iter__(self):
        return SmartListIter(self)

    def __delitem__(self, index):
        super(SmartList, self).__delitem__(index)
        for iterator in self.iterators:
            iterator.item_deleted(index)

我们扩展了内置的list，并使其返回一个自定义迭代器，而不是默认值。每当删除列表中的项目时，我们调用self.iterators列表中每个项目的item_deleted方法。以下是SmartListIter的代码：

^{pr2}$

因此迭代器将自己添加到迭代器列表中，并在完成后从同一列表中删除自己。如果一个索引小于当前索引的项被删除，我们将当前索引减少一个，这样就不会像普通列表迭代器那样跳过一个项。在

next方法返回一个元组(index, item)，而不仅仅是项，因为当使用这些类时，我们就不必再费心于enumerate了，这使得事情变得更简单。在

所以这应该考虑到必须向后退，但是我们仍然需要使用很多索引来处理每个循环中的四个不同的行。既然两条线和两条线是连在一起的，那我们就来上课吧：

class LinePair(object):
    def __init__(self, pair):
        self.pair = pair
        self.sets = [set(line.split()) for line in pair]
        self.c = len(self.sets[0]) * len(self.sets[1])

    def overlap(self, other):
        ab = float(len(self.sets[0] & other.sets[0]) * \
            len(self.sets[1] & other.sets[1]))
        overlap = ab / (self.c + other.c - ab)
        return overlap

    def __str__(self):
        return "".join(self.pair)

pair属性是一个由两行组成的元组，直接从输入文件读取，并用新行完成。我们稍后使用它将该对写回一个文件。我们还将这两条线转换为一个集合，并计算c属性，这是每对线的属性。最后，我们提出了一种计算一对线与另一对线之间重叠的方法。注意，d不见了，因为这只是另一对的c属性。在

现在是大结局：

from itertools import izip

with open("iuputfile.txt") as fileobj:
    pairs = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])

for first_index, first_pair in pairs:
    for second_index, second_pair in SmartListIter(pairs, first_index + 1):
        if first_pair.overlap(second_pair) > 0.25:
            del pairs[second_index]

with open("outputfile.txt", "w") as fileobj:
    for index, pair in pairs:
        fileobj.write(str(pair))

请注意，在这里读取中心循环是多么容易，它有多短。如果将来需要更改此算法，则使用此代码可能比使用其他代码更容易完成。izip用于分组输入文件的两行和两行，如here所述。在

网友

2楼 · 编辑于 2024-09-26 21:53:20

堆栈溢出不是为您编写程序或解决一般调试任务。它是针对那些你试图自己解决却做不到的具体问题，你在问问题，作为一个程序员，你应该能够自己解决问题。像这样启动程序：

python -m pdb my_script.py

现在可以使用n命令逐行检查脚本。如果要查看变量内部的内容，只需键入该变量的名称。通过使用这种方法，你会发现为什么事情不起作用。使用pdb（python调试器）可以做很多其他聪明的事情，但是对于这种情况，n命令就足够了。在

在这里再问一个问题之前，请多努力解决你自己的问题。在

也就是说，你修改过的脚本有什么问题：

^{pr2}$

错误是：

错误1:交集是&。并集是|。在
错误2：由于所有变量都是整数，因此结果也将是整数，除非使用的是python3。如果你是，这不是一个错误。如果不是，则需要确保其中一个变量是float，以强制结果也是float。因此float(ab)。在
错误3：记住总是前后工作。当您删除sets[second_index]时，原来位于sets[second_index + 1]的内容将发生，因此在之后删除{}将删除以前位于{}的内容，这不是您想要的。所以我们先删除最大的索引。在

相关问题更多 >

编程相关推荐

热门问题

热门文章