匹配所有相关字符串问题的回答

匹配所有相关字符串

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我认为正则表达式不是正确的方法。你知道吗 但是，您可以使用<a href="http://en.wikipedia.org/wiki/Disjoint-set_data_structure" rel="nofollow noreferrer">Union Find</a>和<a href="http://en.wikipedia.org/wiki/Levenshtein_distance" rel="nofollow noreferrer">Minimum Edit Distance</a>的组合来实现这一点。你知道吗 对于每个单词组合，确定<code>min_edit_dist</code>，如果距离小于某个阈值，则<code>union</code>将这些单词放在一起。为阈值选择合适的值可能取决于单词的选择。用你的话说，<code>3</code>或<code>4</code>似乎效果不错。你知道吗 <pre><code>import collections, itertools # initialize 'leaders' dictionary, used in union and find leaders = {word: None for word in words} # union similar words together for u, v in itertools.combinations(words, 2): if find(u) != find(v) and min_edit_dist(u, v) < 3: union(u, v) # determine groups of similar words by their leaders groups = collections.defaultdict(set) for x in leaders: groups[find(x)].add(x) print groups.values() </code></pre> 输出，对于<code>union</code>、<code>find</code>和<code>min_edit_dist</code>的实现： <pre><code>[set(['laptop bag']), set(['gruop', 'grop', 'group']), set(['buk', 'book', 'bok']), set(['laftop', 'laptop', 'leptop']), set(['pencil', 'pancil', 'pensil'])] </code></pre> 有关<code>union</code>和<code>find</code>函数，请参阅<a href="https://stackoverflow.com/a/27850318/1639625">this answer</a>。<code>min_edit_dist</code>函数的实现留给读者作为练习。你知道吗 这种方法可能存在的一个问题是，如果所有组之间存在足够密切的差异，它可能最终会合并所有组。你知道吗 <hr/> 关于您自己使用<code>difflib.find_close_matches</code>的方法： 您可以使用<code>cutoff</code>参数来微调匹配的“接近”程度。但是，我没有找到一个适用于所有示例的值，更不用说适用于可能存在的所有其他示例了。<code>0.8</code>适用于<code>laptop</code>，但对<code>book</code>过于严格。还要注意，使用这种方法时，您需要知道哪些是“根词”，这在实践中可能是个问题。你知道吗 另一方面，我的方法不需要先验地知道哪些词是这个群体的“领导者”，而是找到它们本身。对于类似的技术，您可能还想看看<a href="http://en.wikipedia.org/wiki/Cluster_analysis" rel="nofollow noreferrer">cluster analysis algorithms</a>。你知道吗

匹配所有相关字符串

1 个回答

相关Python问题