哈希比较的极限

网友

1楼 · 编辑于 2024-10-03 17:21:02

可以使用set快速查找成员身份。由于文件可以用作迭代器，因此将打开的文件句柄传递给set构造函数将逐行读取集合中的条目，而无需首先填充内存中的中间数组。你知道吗

在此之后，您可以简单地使用set difference-操作符有效地检查哪些散列是新的，并使用union操作符|将新找到的元素添加到已知散列的列表中：

# at program start, init list of known hashes
# open hashes_in.txt, read line by line and add to set
# set removes duplicate elements
with open("hashes.txt", "r") as f:
    hashes = set(f)

# as new hashes are encountered, use this to check if they have been seen before
def compare_hashes(search_hashes, hashes):
    search_hashes = set(search_hashes)

    # find new hashes
    new_hashes = search_hashes - hashes

    # update list of known hashes
    hashes |= new_hashes

    # write out new hashes
    with open("hashes.txt", "a") as f:
        for h in new_hashes:
            f.write(h)

    return new_hashes, hashes



with open("hashes2.txt", "r") as f:
    new_hashes, hashes = compare_hashes(f, hashes)
    print(new_hashes)

这个答案假设您的已知条目列表和搜索条目都来自文件，因此具有作为匹配一部分的尾随换行符。如果这不是您想要的，您可以剥离新行以获得较小的性能开销：

strip_newlines = lambda hashes: (h.strip() for h in hashes)

像这样使用：

hashes = set(strip_newlines(f))
new_hashes, hashes = compare_hashes(strip_newlines(f), hashes)

网友

2楼 · 编辑于 2024-10-03 17:21:02

我会这样做：

class Hashes(object):
    def __init__(self, filename):
        self.filename = filename
        with open(filename, 'rt') as f:             # read the file only once
            self.hashes = set(line.strip() for line in f)

    def add_hash(self, hash):
        if hash not in self.hashes:                 # this is very fast with sets
            self.hashes.add(hash)
            with open(self.filename, 'at') as f:
                print(hash, file=f)                 # write only one hash

hashes = Hashes("hashes.txt")
hashes.add_hash("ff071fdf1e060400")

因为：

文件只读取一次
使用集合检查哈希是否存在非常快（无需读取所有集合）
通过只写入新的哈希来添加哈希
该类简化了多个哈希文件的创建和缓存哈希的清理，并通过组织代码简化了维护

缺点是所有哈希都保存在内存中。如果有数以百万计的散列，这可能会导致问题，但在此之前，这是好的。如果速度很重要的话，那总比罚款好。你知道吗

网友

3楼 · 编辑于 2024-10-03 17:21:02

我同意伯特的看法。如果您希望有很多散列，那么最好使用数据库。如果这只在本地发生，sqlite数据库就可以了。有一个很好的orm库可以用于sqlite；Peewee。它有一些优秀的文档可以让你开始。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章