过滤包含在其他短语中的所有短语的算法

网友

1楼 · 编辑于 2024-09-27 07:34:07

您可以建立一个索引，将单词映射到短语，并执行以下操作：

let matched = set of all phrases
for each word in the searched phrase
    let wordMatch = all phrases containing the current word
    let matched = intersection of matched and wordMatch

在此之后，matched将包含与目标短语中所有单词匹配的所有短语。通过将matched初始化为只包含words[0]的所有短语集，然后只迭代words[1..words.length]，可以很好地优化它。过滤短到与目标短语不匹配的短语也可以提高性能。在

除非我弄错了，否则一个简单实现的最坏情况复杂度（当搜索短语匹配所有短语时），其中O(n·m)是搜索短语中的单词数，m是短语的数量。在

网友

2楼 · 编辑于 2024-09-27 07:34:07

你的算法在词组数量上是二次方的，这可能正是它慢下来的原因。在这里，我用词来索引短语，使其在常见情况下低于二次方。在

# build index
foreach phrase: foreach word: phrases[word] += phrase

# use index to filter out phrases that contain all the words
# from another phrase
foreach phrase:
  foreach word: 
     if first word:
        siblings = phrases[word]
     else
        siblings = siblings intersection phrases[word]
  # siblings now contains any phrase that has at least all our words
  remove each sibling from the output set of phrases  

# done!

网友

3楼 · 编辑于 2024-09-27 07:34:07

这就是求一组集合的最小值的问题。这个问题的定义和算法看起来很幼稚：

set(s for s in sets if not any(other < s for other in sets))

有一些次二次算法可以做到这一点（例如this），但是考虑到N是10000，实现的效率可能更重要。最佳方法在很大程度上取决于输入数据的分布。考虑到输入集是自然语言短语，它们大多不同，redtuna建议的方法应该可以很好地工作。下面是该算法的python实现。在

^{pr2}$

这仍然是二次方的，但是在我的机器上，对于一组包含50%最小值的10k短语，它的运行时间为350ms，单词来自指数分布。在

相关问题更多 >

编程相关推荐

热门问题

热门文章