<p>以下是@mujjiga答案的理论分析:</p>
<p>您可以创建共享同一ngram的单词类。你需要从这些类中选择最小数量的类(即ngram的最小数量)来覆盖整个单词集。这是<a href="https://en.wikipedia.org/wiki/Set_cover_problem" rel="nofollow noreferrer">set cover problem</a>。不幸的是,这个问题是NP难的(<em>不是</em>NP complete<em>,谢谢@mujjiga</em>)。(<em>编辑:因此,没有已知的解决方案可以在合理的时间内为您提供预期的结果。贪婪算法几乎是最好的解决方案(见<a href="https://cs.stackexchange.com/questions/49777/is-greedy-algorithm-the-best-algorithm-for-set-cover-problem">https://cs.stackexchange.com/questions/49777/is-greedy-algorithm-the-best-algorithm-for-set-cover-problem</a>)。在</p>
<p>请注意,即使贪婪算法也可能给出奇怪的结果。取集合<code>{a, b}, {b, c}, {c, d}</code>和超集<code>{a, b, c, d}</code>。这三个子集是最大的。如果首先使用<code>{b, c}</code>,则需要另外两个子集来覆盖超集。如果使用<code>{a, b}</code>或{<cd5>},两个子集就足够了。在</p>
<p>让我们使用贪婪算法,并考虑实现。创建将ngrams映射到单词的字典的代码非常简单:</p>
<pre><code>all_words= ['amsterdam','schiedam','werkendam','amstelveen','schiebroek','werkstad','den haag','rotjeknor','gouda']
n=3
words_by_ngram = {}
for word in all_words:
for ngram in (word[i:i+n] for i in range(0, len(word)-n+1)):
words_by_ngram.setdefault(ngram, set()).add(word)
</code></pre>
<p>如果键<code>setdefault</code>存在,<code>setdefault</code>等同于{<cd7>},否则创建一个空集。这就是<code>O(|all_words|*|len max word|)</code>的复杂性。在</p>
<p>现在,我们想把单词最多的ngram从字典中删除。重复,直到你得到你想要的单词。在</p>
<p>下面是一个简单的版本:</p>
^{pr2}$
<p>输出:</p>
<pre><code>ams -> {'amstelveen', 'amsterdam'}
gou -> {'gouda'}
wer -> {'werkstad', 'werkendam'}
rot -> {'rotjeknor'}
dam -> {'amsterdam', 'werkendam', 'schiedam'}
sch -> {'schiebroek', 'schiedam'}
den -> {'den haag'}
</code></pre>
<p>第二步有一个<code>O(|all_words|*|ngrams|)</code>的复杂性,因为要循环查找max和字典的更新。因此,总体复杂性是<code>O(|all_words|*|ngrams|)</code></p>
<p>使用优先级队列可以降低复杂性。检索最佳ngram的代价是<code>O(1)</code>,但是更新映射到ngram的单词的<code>len</code>具有优先权<code>O(lg |ngrams|)</code>:</p>
<pre><code>import heapq
class PriorityQueue:
"""Adapted from https://docs.python.org/3/library/heapq.html#priority-queue-implementation-notes
A prority of 1 invalidates the entries
"""
def __init__(self, words_by_ngram):
self._d = {ngram:[-len(words), (ngram, words)] for ngram, words in words_by_ngram.items()}
self._pq = list(self._d.values())
heapq.heapify(self._pq)
def pop(self):
"""get the ngram, words tuple with the max word count"""
minus_len, (ngram, words) = heapq.heappop(self._pq)
while minus_len == 1: # entry is not valid
minus_len, (ngram, words) = heapq.heappop(self._pq)
return ngram, words
def update(self, ngram, words_to_remove):
"""remove the words from the sets and update priorities"""
del self._d[ngram]
ngrams_to_inspect = set(word[i:i+n] for i in range(0, len(word)-n+1)
for word in words_to_remove)
for ngram in ngrams_to_inspect:
if ngram not in self._d: continue
self._d[ngram][0] = 1 # use the reference to invalidate the entry
[L, (ngram, words)] = self._d[ngram]
words -= words_to_remove
if words:
self._d[ngram] = [-len(words), (ngram, words)] # new entry
heapq.heappush(self._pq, self._d[ngram]) # add to the pq (O(lg ngrams))
else: # nothing left: remove it from dict
del self._d[ngram]
pq = PriorityQueue(words_by_ngram)
gs = set()
s = set(all_words) # the target
while s:
# take the the best ngram
ngram, words = pq.pop()
gs.add(ngram) # add the ngram to the result
s -= words # remove the words from the target
# remove the words from the dictionary and update priorities
pq.update(ngram, words)
</code></pre>
<p>使用此代码,总体优先级降为<code>O(|all_words|*|lg ngrams|)</code>。也就是说,我很想知道这是否比你5万件物品的原始版本快。在</p>