快速高效的python模糊匹配子串方法

s="Difference between a crocodile and an alligator is......." #Long paragraph, >10000 words to_search=["crocodile","insect","alligator"] for i in range(len(to_search)): for j in range(len(s)): a = s[j:j+len(to_search[i])] match = difflib.SequenceMatcher(None,a,to_search[I]).ratio() if(match>0.9): #90% similarity print(a)

1条回答

网友

1楼 · 发布于 2024-06-23 18:52:25

在大量文本上花费太长时间的原因之一是，您在整个文本中重复滑动窗口多次，搜索的每个单词一次。大量的计算是将你的单词和相同长度的块进行比较，这些块可能包含多个单词的一部分

如果您愿意假设您总是希望匹配单个单词，那么您可以将文本拆分为多个单词，然后与这些单词进行比较——比较的次数要少得多（单词数，与从文本中每个位置开始的窗口数相比），而且拆分只需执行一次，而不是针对每个搜索词。下面是一个例子：

to_search= ["crocodile", "insect", "alligator"]
s = "Difference between a crocodile and an alligator is" #Long paragraph, >10000 words
s_words = s.replace(".", " ").split(" ") # Split on spaces, with periods removed
for search_for in to_search:
    for s_word in s_words:
        match = difflib.SequenceMatcher(None, s_word, search_for).ratio()
        if(match > 0.9):  #90% similarity
            print(s_word)
            continue      # no longer need to continue the search for this word!

这会给你显著的加速，希望它能解决你的需求

快乐编码

相关问题更多 >

编程相关推荐

热门问题

热门文章