Sqlite具有真正的“全文搜索”和拼写错误（FTS+spellfix一起使用）问题的回答

Sqlite具有真正的“全文搜索”和拼写错误（FTS+spellfix一起使用）

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

假设我们有100万行这样的行： <pre><code>import sqlite3 db = sqlite3.connect(':memory:') c = db.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "Riemann")') c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")') </code></pre> <h2>背景：</h2> 我知道如何使用Sqlite： <ul> <li>使用一个单字查询查找一行，最多有几个拼写错误，其中<a href="https://www.sqlite.org/spellfix1.html" rel="noreferrer">^{<cd1>}</a>模块和Levenshtein距离（我发布了一个<a href="https://stackoverflow.com/questions/49779281/string-similarity-with-python-sqlite-levenshtein-distance-edit-distance">detailed answer here</a>关于如何编译它，如何使用它，…）： ^{pr2}$ 如果排了1米，速度会非常慢！作为<a href="https://dba.stackexchange.com/questions/203679/does-a-levenshtein-distance-involve-some-computation-for-each-single-row">detailed here</a>，<code>postgresql</code>可以使用<code>trigrams</code>对此进行优化。Sqlite提供的一个快速解决方案是使用<code>VIRTUAL TABLE USING spellfix</code>： <pre><code>c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1') c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")') c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall() #Query: 'Riehmand' #Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working! </code></pre></li> <li>查找一个查询与FTS（“全文搜索”）匹配的查询的表达式： <pre><code>c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)') c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")') c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall() #Query: 'NUMBERS carmichael' #Answer: [(2, u'All the Carmichael numbers')] </code></pre> 它不区分大小写，甚至可以使用两个单词顺序错误的查询，等等：FTS确实非常强大。但缺点是每个查询关键字的拼写都必须正确，即FTS本身不允许拼写错误。</li> </ul> <h2>问题：</h2> 如何使用Sqlite进行全文搜索（FTS）并允许拼写错误？即“FTS+spellfix”一起使用 示例： <ul> <li>数据库中的行：<code>"All the Carmichael numbers"</code></li> <li>查询：<code>"NUMMBER carmickaeel"</code>应该匹配它！在</li> </ul> 如何使用Sqlite实现这一点？ 由于<a href="https://www.sqlite.org/spellfix1.html" rel="noreferrer">this page</a>状态，Sqlite可能会这样： <blockquote> Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words. </blockquote> 链接问题：<a href="https://stackoverflow.com/questions/49779281/string-similarity-with-python-sqlite-levenshtein-distance-edit-distance">String similarity with Python + Sqlite (Levenshtein distance / edit distance)</a>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<code>spellfix1</code>文档实际上告诉您如何做到这一点。从<a href="https://www.sqlite.org/spellfix1.html#overview" rel="nofollow noreferrer">Overview section</a>： <blockquote> If you intend to use this virtual table in cooperation with an FTS4 table (for spelling correction of search terms) then you might extract the vocabulary using an <a href="https://www.sqlite.org/fts3.html#fts4aux" rel="nofollow noreferrer">fts4aux</a> table: <pre><code>INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*'; </code></pre> </blockquote> <code>SELECT term from search_aux WHERE col='*'</code>语句<a href="https://stackoverflow.com/questions/29997385/how-do-you-extract-all-the-tokens-in-a-sqlite-fts-table">extracts all the indexed tokens</a>。在 将其与您的示例相连接，其中<code>mytable2</code>是您的fts4虚拟表，您可以创建一个<code>fts4aux</code>表，并将这些标记插入到<code>mytable3</code>spellfix1表中，方法是： <pre><code>CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2); INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*'; </code></pre> 您可能需要进一步限定该查询，以跳过已插入spellfix1中的任何术语，否则最终会出现两个条目： ^{pr2}$ 现在您可以使用<code>mytable3</code>将拼写错误的单词映射到已更正的标记，然后在<code>MATCH</code>查询中使用这些已更正的标记<code>mytable2</code>。在 根据您的nead，这可能意味着您需要自己进行令牌处理和查询构建；没有公开的fts4查询语法分析器。因此，需要拆分两个令牌搜索字符串，每个令牌运行<code>spellfix1</code>表以映射到现有令牌，然后将这些令牌输入到fts4查询。在 忽略SQL语法来处理此问题，使用Python进行拆分非常简单： <pre><code>def spellcheck_terms(conn, terms): cursor = conn.cursor() base_spellfix = """ SELECT :term{0} as term, word FROM spellfix1data WHERE word MATCH :term{0} and top=1 """ terms = terms.split() params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join([ base_spellfix.format(i + 1) for i in range(len(params))]) cursor.execute(query, params) correction_map = dict(cursor) return " ".join([correction_map.get(t, t) for t in terms]) def spellchecked_search(conn, terms): corrected_terms = spellcheck_terms(conn, terms) cursor = conn.cursor() fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?' cursor.execute(fts_query, (corrected_terms,)) return cursor.fetchall() </code></pre> 然后为<code>spellchecked_search(db, "NUMMBER carmickaeel")</code>返回<code>[('All the Carmichael numbers',)]</code>。在 然后，在Python中保持拼写检查处理允许您根据需要支持更复杂的FTS查询；您可能需要<a href="https://github.com/mackyle/sqlite/blob/master/ext/fts3/fts3_expr.c#L49-L63" rel="nofollow noreferrer">reimplement the expression parser</a>才能这样做，但至少Python为您提供了这样做的工具。在 一个完整的例子，将上述方法打包到一个类中，该类简单地将术语提取为字母数字字符序列（根据我对表达式语法规范的阅读，这就足够了）： <pre><code>import re import sqlite3 import sys class FTS4SpellfixSearch(object): def __init__(self, conn, spellfix1_path): self.conn = conn self.conn.enable_load_extension(True) self.conn.load_extension(spellfix1_path) def create_schema(self): self.conn.executescript( """ CREATE VIRTUAL TABLE IF NOT EXISTS fts4data USING fts4(description text); CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms USING fts4aux(fts4data); CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data USING spellfix1; """ ) def index_text(self, *text): cursor = self.conn.cursor() with self.conn: params = ((t,) for t in text) cursor.executemany("INSERT INTO fts4data VALUES (?)", params) cursor.execute( """ INSERT INTO spellfix1data(word) SELECT term FROM fts4data_terms WHERE col='*' AND term not in (SELECT word from spellfix1data_vocab) """ ) # fts3 / 4 search expression tokenizer # no attempt is made to validate the expression, only # to identify valid search terms and extract them. # the fts3/4 tokenizer considers any alphanumeric ASCII character # and character in the range U+0080 and over to be terms. if sys.maxunicode == 0xFFFF: # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+") else: # UCS4 _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+") def _terms_from_query(self, search_query): """Extract search terms from a fts3/4 query Returns a list of terms and a template such that template.format(*terms) reconstructs the original query. terms using partial* syntax are ignored, as you can't distinguish between a misspelled prefix search that happens to match existing tokens and a valid spelling that happens to have 'near' tokens in the spellfix1 database that would not otherwise be matched by fts4 """ template, terms, lastpos = [], [], 0 for match in self._fts4_expr_terms.finditer(search_query): token, (start, end) = match.group(), match.span() # skip columnname: and partial* terms by checking next character ismeta = search_query[end:end + 1] in {":", "*"} # skip digits if preceded by "NEAR/" ismeta = ismeta or ( token.isdigit() and template and template[-1] == "NEAR" and "/" in search_query[lastpos:start]) if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta: # full search term, not a keyword, column name or partial* terms.append(token) token = "{}" template += search_query[lastpos:start], token lastpos = end template.append(search_query[lastpos:]) return terms, "".join(template) def spellcheck_terms(self, search_query): cursor = self.conn.cursor() base_spellfix = """ SELECT :term{0} as term, word FROM spellfix1data WHERE word MATCH :term{0} and top=1 """ terms, template = self._terms_from_query(search_query) params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join( [base_spellfix.format(i + 1) for i in range(len(params))] ) cursor.execute(query, params) correction_map = dict(cursor) return template.format(*(correction_map.get(t, t) for t in terms)) def search(self, search_query): corrected_query = self.spellcheck_terms(search_query) cursor = self.conn.cursor() fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?" cursor.execute(fts_query, (corrected_query,)) return { "terms": search_query, "corrected": corrected_query, "results": cursor.fetchall(), } </code></pre> 以及使用该类的交互式演示： <pre><code>>>> db = sqlite3.connect(":memory:") >>> fts = FTS4SpellfixSearch(db, './spellfix') >>> fts.create_schema() >>> fts.index_text("All the Carmichael numbers") # your example >>> from pprint import pprint >>> pprint(fts.search('NUMMBER carmickaeel')) {'corrected': 'numbers carmichael', 'results': [('All the Carmichael numbers',)], 'terms': 'NUMMBER carmickaeel'} >>> fts.index_text( ... "They are great", ... "Here some other numbers", ... ) >>> pprint(fts.search('here some')) # edgecase, multiple spellfix matches {'corrected': 'here some', 'results': [('Here some other numbers',)], 'terms': 'here some'} >>> pprint(fts.search('NUMMBER NOT carmickaeel')) # using fts4 query syntax {'corrected': 'numbers NOT carmichael', 'results': [('Here some other numbers',)], 'terms': 'NUMMBER NOT carmickaeel'} </code></pre>

Sqlite具有真正的“全文搜索”和拼写错误（FTS+spellfix一起使用）

1 个回答

相关Python问题