在字符串列表中查找唯一ngram的最小列表

网友

1楼 · 编辑于 2024-10-03 02:36:06

以下是@mujjiga答案的理论分析：

您可以创建共享同一ngram的单词类。你需要从这些类中选择最小数量的类（即ngram的最小数量）来覆盖整个单词集。这是set cover problem。不幸的是，这个问题是NP难的（不是NP complete，谢谢@mujjiga）。（编辑：因此，没有已知的解决方案可以在合理的时间内为您提供预期的结果。贪婪算法几乎是最好的解决方案（见https://cs.stackexchange.com/questions/49777/is-greedy-algorithm-the-best-algorithm-for-set-cover-problem）。在

请注意，即使贪婪算法也可能给出奇怪的结果。取集合{a, b}, {b, c}, {c, d}和超集{a, b, c, d}。这三个子集是最大的。如果首先使用{b, c}，则需要另外两个子集来覆盖超集。如果使用{a, b}或{}，两个子集就足够了。在

让我们使用贪婪算法，并考虑实现。创建将ngrams映射到单词的字典的代码非常简单：

all_words= ['amsterdam','schiedam','werkendam','amstelveen','schiebroek','werkstad','den haag','rotjeknor','gouda']
n=3
words_by_ngram = {}
for word in all_words:
    for ngram in (word[i:i+n] for i in range(0, len(word)-n+1)):
        words_by_ngram.setdefault(ngram, set()).add(word)

如果键setdefault存在，setdefault等同于{}，否则创建一个空集。这就是O(|all_words|*|len max word|)的复杂性。在

现在，我们想把单词最多的ngram从字典中删除。重复，直到你得到你想要的单词。在

下面是一个简单的版本：

^{pr2}$
输出：
ams -> {'amstelveen', 'amsterdam'} gou -> {'gouda'} wer -> {'werkstad', 'werkendam'} rot -> {'rotjeknor'} dam -> {'amsterdam', 'werkendam', 'schiedam'} sch -> {'schiebroek', 'schiedam'} den -> {'den haag'}
第二步有一个O(|all_words|*|ngrams|)的复杂性，因为要循环查找max和字典的更新。因此，总体复杂性是O(|all_words|*|ngrams|)
使用优先级队列可以降低复杂性。检索最佳ngram的代价是O(1)，但是更新映射到ngram的单词的len具有优先权O(lg |ngrams|)：
import heapq class PriorityQueue: """Adapted from https://docs.python.org/3/library/heapq.html#priority-queue-implementation-notes A prority of 1 invalidates the entries """ def __init__(self, words_by_ngram): self._d = {ngram:[-len(words), (ngram, words)] for ngram, words in words_by_ngram.items()} self._pq = list(self._d.values()) heapq.heapify(self._pq) def pop(self): """get the ngram, words tuple with the max word count""" minus_len, (ngram, words) = heapq.heappop(self._pq) while minus_len == 1: # entry is not valid minus_len, (ngram, words) = heapq.heappop(self._pq) return ngram, words def update(self, ngram, words_to_remove): """remove the words from the sets and update priorities""" del self._d[ngram] ngrams_to_inspect = set(word[i:i+n] for i in range(0, len(word)-n+1) for word in words_to_remove) for ngram in ngrams_to_inspect: if ngram not in self._d: continue self._d[ngram][0] = 1 # use the reference to invalidate the entry [L, (ngram, words)] = self._d[ngram] words -= words_to_remove if words: self._d[ngram] = [-len(words), (ngram, words)] # new entry heapq.heappush(self._pq, self._d[ngram]) # add to the pq (O(lg ngrams)) else: # nothing left: remove it from dict del self._d[ngram] pq = PriorityQueue(words_by_ngram) gs = set() s = set(all_words) # the target while s: # take the the best ngram ngram, words = pq.pop() gs.add(ngram) # add the ngram to the result s -= words # remove the words from the target # remove the words from the dictionary and update priorities pq.update(ngram, words)
使用此代码，总体优先级降为O(|all_words|*|lg ngrams|)。也就是说，我很想知道这是否比你5万件物品的原始版本快。在

网友
2楼 · 编辑于 2024-10-03 02:36:06

上述解决方案失败的原因是
Counter以无序的方式返回三元组，因此，如果您多次运行解决方案，您也将随机获得所需的解决方案
当你找到这个组合的时候，你就返回它，你既不能按长度的顺序去做，也不能在所有的组合中找到满足问题的最佳组合
在这里，我将按照包含三元组列表的最小到最高的顺序进行，然后在找到解决方案后立即返回。在
from itertools import permutations def checkTrigramsPresentInList(trigrams_list,input_list): for input_string in input_list: flag = False for trigram in trigrams_list: if trigram in input_string: flag = True if not flag: return False return True def ngrams(text, n=3): return [text[i:i + n] for i in range(len(text) - n + 1)] def identifying_grams(input_list, n=3): trigrams = [] for item in input_list: trigrams += ngrams(item) len_of_trigrams = len(trigrams) trigrams_unique = list(set(trigrams)) idx =1 correct_tri_lists = [] unique_trigrams_list = [] while idx <= len_of_trigrams: trigrams_lists = permutations(trigrams_unique,idx) for trigrams_list in trigrams_lists: print(trigrams_list) if not trigrams_list in unique_trigrams_list: if checkTrigramsPresentInList(list(trigrams_list),input_list): correct_tri_lists.append(list(trigrams_list)) ##Uncomment below lines if only one combination is needed if correct_tri_lists: return correct_tri_lists unique_trigrams_list.append(trigrams_list) idx = idx+1 list1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen'] print(identifying_grams(list1)) # # Good, we get: ['ter', 'haa', 'utr', 'gro'] list2 = ['amsterdam','schiedam'] print(identifying_grams(list2)) # # Good, we get: ['dam'] list3 = ['amsterdam','schiedam','terwolde','wolstad'] print(identifying_grams(list3))

网友
3楼 · 编辑于 2024-10-03 02:36:06

下面是集合覆盖贪婪算法的一个实现。在我的第二本英语词典里，大约有5万个单词在我的机器上运行。输出并不总是最优的，但在实践中往往是接近的。你也许可以用一个外部的整数编程库来解决你的实例的最优化问题，但我不知道你是否想朝这个方向发展。在

下面的代码动态维护ngram和未覆盖文本的二分图。一个微妙的地方是，由于Python的标准库中缺少一个侵入式堆，所以我利用了这样一个事实：键只会增加到伪堆。每一个ngram都在堆中，分数小于或等于它应该的值。当我们拉取最小值时，如果它小于应该的值，我们就用更新后的值把它放回去。否则，我们知道这是真正的最小值。在

此代码应产生确定的输出。在每一个步骤中，它选择词典编纂最小的ngram来覆盖最大数量的未覆盖文本。在

import collections
import heapq


def ngrams_from_text(text, n):
    return {text[i:i + n] for i in range(len(text) - n + 1)}


def greedy_ngram_cover(texts, n):
    neighbors_of_text = {text: ngrams_from_text(text, n) for text in texts}
    neighbors_of_ngram = collections.defaultdict(set)
    for text, ngrams in neighbors_of_text.items():
        for ngram in ngrams:
            neighbors_of_ngram[ngram].add(text)
    heap = [(-len(neighbors), ngram)
            for (ngram, neighbors) in neighbors_of_ngram.items()]
    heapq.heapify(heap)
    cover = []
    while neighbors_of_text:
        score, ngram = heapq.heappop(heap)
        neighbors = neighbors_of_ngram[ngram]
        if score != -len(neighbors):
            heapq.heappush(heap, (-len(neighbors), ngram))
            continue
        cover.append(ngram)
        for neighbor in list(neighbors):
            for neighbor_of_neighbor in neighbors_of_text[neighbor]:
                neighbors_of_ngram[neighbor_of_neighbor].remove(neighbor)
            del neighbors_of_text[neighbor]
    return cover


print(
    greedy_ngram_cover(
        ['amsterdam', 'rotterdam', 'haarlem', 'utrecht', 'groningen'], 3))

相关问题更多 >

编程相关推荐

热门问题

热门文章