<p>下面是集合覆盖贪婪算法的一个实现。在我的第二本英语词典里,大约有5万个单词在我的机器上运行。输出并不总是最优的,但在实践中往往是接近的。你也许可以用一个外部的整数编程库来解决你的实例的最优化问题,但我不知道你是否想朝这个方向发展。在</p>
<p>下面的代码动态维护ngram和未覆盖文本的二分图。一个微妙的地方是,由于Python的标准库中缺少一个侵入式堆,所以我利用了这样一个事实:键只会增加到伪堆。每一个ngram都在堆中,分数小于或等于它应该的值。当我们拉取最小值时,如果它小于应该的值,我们就用更新后的值把它放回去。否则,我们知道这是真正的最小值。在</p>
<p>此代码应产生确定的输出。在每一个步骤中,它选择词典编纂最小的ngram来覆盖最大数量的未覆盖文本。在</p>
<pre><code>import collections
import heapq
def ngrams_from_text(text, n):
return {text[i:i + n] for i in range(len(text) - n + 1)}
def greedy_ngram_cover(texts, n):
neighbors_of_text = {text: ngrams_from_text(text, n) for text in texts}
neighbors_of_ngram = collections.defaultdict(set)
for text, ngrams in neighbors_of_text.items():
for ngram in ngrams:
neighbors_of_ngram[ngram].add(text)
heap = [(-len(neighbors), ngram)
for (ngram, neighbors) in neighbors_of_ngram.items()]
heapq.heapify(heap)
cover = []
while neighbors_of_text:
score, ngram = heapq.heappop(heap)
neighbors = neighbors_of_ngram[ngram]
if score != -len(neighbors):
heapq.heappush(heap, (-len(neighbors), ngram))
continue
cover.append(ngram)
for neighbor in list(neighbors):
for neighbor_of_neighbor in neighbors_of_text[neighbor]:
neighbors_of_ngram[neighbor_of_neighbor].remove(neighbor)
del neighbors_of_text[neighbor]
return cover
print(
greedy_ngram_cover(
['amsterdam', 'rotterdam', 'haarlem', 'utrecht', 'groningen'], 3))
</code></pre>