c#算法，用于过滤一组包含在其他短语中的所有短语

2 月，3 周 Questions & Answers 705

给定一组短语，我想筛选包含任何其他短语的所有短语。这里包含的意思是，如果一个短语包含另一个短语的所有单词，那么它应该被过滤掉。短语中单词的顺序无关紧要

到目前为止，我得到的是：

根据每个短语中的单词数对集合进行排序
对于集合中的每个短语X：
1. 对于集合其余部分中的每个短语Y：
  1. 如果X中的所有单词都在Y中，那么X包含在Y中，丢弃Y

如果给出一个大约一万个短语的列表，这是很慢的。还有更好的选择吗

共 (4) 个答案

# 1 楼答案

按内容对短语进行排序，即“Z A”->；'“Z”，那么从最短的短语到较长的短语很容易消除

# 2 楼答案

你的算法在词组的数量上是二次的，这可能是减慢速度的原因。在这里，我用单词索引短语，以在常见情况下得到二次型

# build index
foreach phrase: foreach word: phrases[word] += phrase

# use index to filter out phrases that contain all the words
# from another phrase
foreach phrase:
  foreach word: 
     if first word:
        siblings = phrases[word]
     else
        siblings = siblings intersection phrases[word]
  # siblings now contains any phrase that has at least all our words
  remove each sibling from the output set of phrases  

# done!

# 3 楼答案
您可以建立一个索引，将单词映射到短语，并执行以下操作：
```
let matched = set of all phrases
for each word in the searched phrase
    let wordMatch = all phrases containing the current word
    let matched = intersection of matched and wordMatch
```
在此之后，matched将包含与目标短语中的所有单词匹配的所有短语。通过将matched初始化为只包含words[0]的所有短语集，然后只在words[1..words.length]上迭代，可以很好地优化它。过滤太短而与目标短语不匹配的短语也可以提高性能

除非我弄错了，否则一个简单的实现的最坏情况复杂性（当搜索短语匹配所有短语时）是O(n·m)，其中n是搜索短语中的单词数，m是短语数

# 4 楼答案

这就是寻找一组集合的极小值的问题。朴素的算法和问题定义如下：

set(s for s in sets if not any(other < s for other in sets))

有一些次二次算法可以做到这一点（比如this），但考虑到N是10000，实现的效率可能更重要。最佳方法在很大程度上取决于输入数据的分布。考虑到输入集是主要不同的自然语言短语，redtuna建议的方法应该可以很好地工作。下面是该算法的python实现

from collections import defaultdict

def find_minimal_phrases(phrases):
    # Make the phrases hashable
    phrases = map(frozenset, phrases)

    # Create a map to find all phrases containing a word
    phrases_containing = defaultdict(set)
    for phrase in phrases:
        for word in phrase:
            phrases_containing[word].add(phrase)

    minimal_phrases = []
    found_superphrases = set()
    # in sorted by length order to find minimal sets first thanks to the
    # fact that a.superset(b) implies len(a) > len(b)
    for phrase in sorted(phrases, key=len):
        if phrase not in found_superphrases:
            connected_phrases = [phrases_containing[word] for word in phrase]
            connected_phrases.sort(key=len)
            superphrases = reduce(set.intersection, connected_phrases)
            found_superphrases.update(superphrases)
            minimal_phrases.append(phrase)
    return minimal_phrases

这仍然是二次的，但在我的机器上，它在350ms内运行一组10k短语，其中包含50%的最小值，单词来自指数分布

Python中文网

有 Java 编程相关的问题?

c#算法，用于过滤一组包含在其他短语中的所有短语

共 (4) 个答案

# 1 楼答案

# 2 楼答案

# 3 楼答案

# 4 楼答案