在字符串列表中查找唯一ngram的最小列表问题的回答

在字符串列表中查找唯一ngram的最小列表

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

以下是@mujjiga答案的理论分析： 您可以创建共享同一ngram的单词类。你需要从这些类中选择最小数量的类（即ngram的最小数量）来覆盖整个单词集。这是<a href="https://en.wikipedia.org/wiki/Set_cover_problem" rel="nofollow noreferrer">set cover problem</a>。不幸的是，这个问题是NP难的（不是NP complete，谢谢@mujjiga）。（编辑：因此，没有已知的解决方案可以在合理的时间内为您提供预期的结果。贪婪算法几乎是最好的解决方案（见<a href="https://cs.stackexchange.com/questions/49777/is-greedy-algorithm-the-best-algorithm-for-set-cover-problem">https://cs.stackexchange.com/questions/49777/is-greedy-algorithm-the-best-algorithm-for-set-cover-problem</a>）。在 请注意，即使贪婪算法也可能给出奇怪的结果。取集合<code>{a, b}, {b, c}, {c, d}</code>和超集<code>{a, b, c, d}</code>。这三个子集是最大的。如果首先使用<code>{b, c}</code>，则需要另外两个子集来覆盖超集。如果使用<code>{a, b}</code>或{<cd5>}，两个子集就足够了。在 让我们使用贪婪算法，并考虑实现。创建将ngrams映射到单词的字典的代码非常简单： <pre><code>all_words= ['amsterdam','schiedam','werkendam','amstelveen','schiebroek','werkstad','den haag','rotjeknor','gouda'] n=3 words_by_ngram = {} for word in all_words: for ngram in (word[i:i+n] for i in range(0, len(word)-n+1)): words_by_ngram.setdefault(ngram, set()).add(word) </code></pre> 如果键<code>setdefault</code>存在，<code>setdefault</code>等同于{<cd7>}，否则创建一个空集。这就是<code>O(|all_words|*|len max word|)</code>的复杂性。在 现在，我们想把单词最多的ngram从字典中删除。重复，直到你得到你想要的单词。在 下面是一个简单的版本： ^{pr2}$ 输出： <pre><code>ams -> {'amstelveen', 'amsterdam'} gou -> {'gouda'} wer -> {'werkstad', 'werkendam'} rot -> {'rotjeknor'} dam -> {'amsterdam', 'werkendam', 'schiedam'} sch -> {'schiebroek', 'schiedam'} den -> {'den haag'} </code></pre> 第二步有一个<code>O(|all_words|*|ngrams|)</code>的复杂性，因为要循环查找max和字典的更新。因此，总体复杂性是<code>O(|all_words|*|ngrams|)</code> 使用优先级队列可以降低复杂性。检索最佳ngram的代价是<code>O(1)</code>，但是更新映射到ngram的单词的<code>len</code>具有优先权<code>O(lg |ngrams|)</code>： <pre><code>import heapq class PriorityQueue: """Adapted from https://docs.python.org/3/library/heapq.html#priority-queue-implementation-notes A prority of 1 invalidates the entries """ def __init__(self, words_by_ngram): self._d = {ngram:[-len(words), (ngram, words)] for ngram, words in words_by_ngram.items()} self._pq = list(self._d.values()) heapq.heapify(self._pq) def pop(self): """get the ngram, words tuple with the max word count""" minus_len, (ngram, words) = heapq.heappop(self._pq) while minus_len == 1: # entry is not valid minus_len, (ngram, words) = heapq.heappop(self._pq) return ngram, words def update(self, ngram, words_to_remove): """remove the words from the sets and update priorities""" del self._d[ngram] ngrams_to_inspect = set(word[i:i+n] for i in range(0, len(word)-n+1) for word in words_to_remove) for ngram in ngrams_to_inspect: if ngram not in self._d: continue self._d[ngram][0] = 1 # use the reference to invalidate the entry [L, (ngram, words)] = self._d[ngram] words -= words_to_remove if words: self._d[ngram] = [-len(words), (ngram, words)] # new entry heapq.heappush(self._pq, self._d[ngram]) # add to the pq (O(lg ngrams)) else: # nothing left: remove it from dict del self._d[ngram] pq = PriorityQueue(words_by_ngram) gs = set() s = set(all_words) # the target while s: # take the the best ngram ngram, words = pq.pop() gs.add(ngram) # add the ngram to the result s -= words # remove the words from the target # remove the words from the dictionary and update priorities pq.update(ngram, words) </code></pre> 使用此代码，总体优先级降为<code>O(|all_words|*|lg ngrams|)</code>。也就是说，我很想知道这是否比你5万件物品的原始版本快。在

在字符串列表中查找唯一ngram的最小列表

1 个回答

相关Python问题