<p>可以使用<a href="https://docs.python.org/3.6/library/stdtypes.html#set" rel="nofollow">set</a>快速查找成员身份。由于文件可以用作迭代器,因此将打开的文件句柄传递给set构造函数将逐行读取集合中的条目,而无需首先填充内存中的中间数组。你知道吗</p>
<p>在此之后,您可以简单地使用set difference<code>-</code>操作符有效地检查哪些散列是新的,并使用union操作符<code>|</code>将新找到的元素添加到已知散列的列表中:</p>
<pre><code># at program start, init list of known hashes
# open hashes_in.txt, read line by line and add to set
# set removes duplicate elements
with open("hashes.txt", "r") as f:
hashes = set(f)
# as new hashes are encountered, use this to check if they have been seen before
def compare_hashes(search_hashes, hashes):
search_hashes = set(search_hashes)
# find new hashes
new_hashes = search_hashes - hashes
# update list of known hashes
hashes |= new_hashes
# write out new hashes
with open("hashes.txt", "a") as f:
for h in new_hashes:
f.write(h)
return new_hashes, hashes
with open("hashes2.txt", "r") as f:
new_hashes, hashes = compare_hashes(f, hashes)
print(new_hashes)
</code></pre>
<p>这个答案假设您的已知条目列表和搜索条目都来自文件,因此具有作为匹配一部分的尾随换行符。如果这不是您想要的,您可以剥离新行以获得较小的性能开销:</p>
<pre><code>strip_newlines = lambda hashes: (h.strip() for h in hashes)
</code></pre>
<p>像这样使用:</p>
<pre><code>hashes = set(strip_newlines(f))
new_hashes, hashes = compare_hashes(strip_newlines(f), hashes)
</code></pre>