Python中的Whoosh模糊字符串搜索

from whoosh.index import create_in from whoosh.fields import * schema = Schema(name=TEXT(stored=True)) ix = create_in("indexdir", schema) writer = ix.writer() test_items = [u"Eagle Bank and Trust Company of Missouri"] writer.add_document(name=item) writer.commit() from whoosh.qparser import QueryParser from whoosh.query import FuzzyTerm with ix.searcher() as s: qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm) q = qp.parse(u"Eagle Bank & Trust Co of Missouri") results = s.search(q) print results

3条回答

网友

1楼 · 编辑于 2024-09-28 23:24:42

您可以在Whoosh中使用模糊搜索将Co与{}进行匹配，但由于Co和{}之间的差异很大，所以不应该进行匹配。Co与Company相似，因为Be与{}相似，ny与{}相似，你可以想象搜索结果有多糟糕，有多大。在

但是，如果您想将Compan或compani或{}与{}匹配，可以使用FuzzyTerm的个性化类，默认maxdist等于2或更多：

maxdist – The maximum edit distance from the given text.

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

然后：

^{pr2}$

您可以通过将maxdist设置为5来匹配Co和{}，但是我说的这会导致错误的搜索结果。我建议将maxdist从1保留到{}。在

如果您正在寻找匹配词的语言变体，最好使用^{}。在

注意：旧版的Whoosh有minsimilarity而不是{}。在

网友

2楼 · 编辑于 2024-09-28 23:24:42

也许这其中的一些东西可能会有所帮助（字符串匹配由seatgeek的开源人员提供）：

https://github.com/seatgeek/fuzzywuzzy

网友

3楼 · 编辑于 2024-09-28 23:24:42

为了将来的参考，肯定有更好的方法来做这件事，但这是我的机会。在

# -*- coding: utf-8 -*-
import whoosh
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.query import *
from whoosh.qparser import QueryParser

schema = Schema(name=TEXT(stored=True))
idx = create_in("C:\\idx_name\\", schema, "idx_name")

writer = idx.writer()

writer.add_document(name=u"This is craaazy shit")
writer.add_document(name=u"This is craaazy beer")
writer.add_document(name=u"Raphaël rocks")
writer.add_document(name=u"Rockies are mountains")

writer.commit()

s = idx.searcher()
print "Fields: ", list(s.lexicon("name"))
qp = QueryParser("name", schema=schema, termclass=FuzzyTerm)

for i in range(1,40):
    res = s.search(FuzzyTerm("name", "just rocks", maxdist=i, prefixlength=0))
    if len(res) > 0:
        for r in res:
            print "Potential match ( %s ): [  %s  ]" % ( i, r["name"] )
        break
    else:
        print "Pass: %s" % i

s.close()

相关问题更多 >

编程相关推荐

热门问题

热门文章