数千个文本文件的高效模糊字符串比较问题的回答

数千个文本文件的高效模糊字符串比较

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

<p>我需要在几千个纯文本文件中搜索一组名称。我正在生成三元组来保留上下文。我需要解释一些小的拼写错误，所以我使用Levenshtein距离计算，函数lev（）。最后一次命中的结果是命中一个名。我的python程序按预期工作，但速度非常慢。我正在寻找一种更快的方法来完成这个搜索，最好是在python中，但是我的Googlefu让我失败了。程序的通用验证版本如下：</p> <pre><code>from sklearn.feature_extraction.text import CountVectorizer import os textfiles = [] newgrams = set() ngrams = [] hitlist = [] path = 'path of folder of textfiles' names = ['john james doe', 'jane jill doe'] vectorizer = CountVectorizer(input = 'filename', ngram_range = (3,3), strip_accents='unicode', stop_words='english', token_pattern='[a-zA-Z\-]\\w*', encoding='utf-8', decode_error = 'replace', lowercase = True) ngramer = vectorizer.build_analyzer() for dirpath, dirnames, filenames in os.walk(path): for files in filenames: if files.endswith('.txt'): textfiles.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(files) ctFiles = len(textfiles) ctNames = len(names) for i in range(ctFiles): newgrams = set(ngramer(path+'/'+textfiles[i])) ngrams.append(newgrams) for i in range(ctNames): splitname = names[i].split() for j in range(ctFiles): tempset = set() for k in range(len(splitname)): if k == 0: ## subset only the trigrams that "match" first name for trigram in ngrams[j]: for word in trigram.split(): if lev(splitname[k], word) < 2: tempset.add(trigram) else: ## search that subset for middle/last name if len(tempset) > 0: for trigram in tempset: for word in trigram.split(): if lev(splitname[k], word) < 2: hitlist.append([names[i], textfiles[j], trigram]) print(hitlist) ## eventually save to CSV </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

数千个文本文件的高效模糊字符串比较

1 个回答

相关Python问题