<p>我需要在几千个纯文本文件中搜索一组名称。我正在生成三元组来保留上下文。我需要解释一些小的拼写错误,所以我使用Levenshtein距离计算,函数lev()。最后一次命中的结果是命中一个名。我的python程序按预期工作,但速度非常慢。我正在寻找一种更快的方法来完成这个搜索,最好是在python中,但是我的Googlefu让我失败了。程序的通用验证版本如下:</p>
<pre><code>from sklearn.feature_extraction.text import CountVectorizer
import os
textfiles = []
newgrams = set()
ngrams = []
hitlist = []
path = 'path of folder of textfiles'
names = ['john james doe', 'jane jill doe']
vectorizer = CountVectorizer(input = 'filename', ngram_range = (3,3),
strip_accents='unicode', stop_words='english',
token_pattern='[a-zA-Z\-]\\w*',
encoding='utf-8', decode_error = 'replace', lowercase = True)
ngramer = vectorizer.build_analyzer()
for dirpath, dirnames, filenames in os.walk(path):
for files in filenames:
if files.endswith('.txt'):
textfiles.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(files)
ctFiles = len(textfiles)
ctNames = len(names)
for i in range(ctFiles):
newgrams = set(ngramer(path+'/'+textfiles[i]))
ngrams.append(newgrams)
for i in range(ctNames):
splitname = names[i].split()
for j in range(ctFiles):
tempset = set()
for k in range(len(splitname)):
if k == 0:
## subset only the trigrams that "match" first name
for trigram in ngrams[j]:
for word in trigram.split():
if lev(splitname[k], word) < 2:
tempset.add(trigram)
else:
## search that subset for middle/last name
if len(tempset) > 0:
for trigram in tempset:
for word in trigram.split():
if lev(splitname[k], word) < 2:
hitlist.append([names[i], textfiles[j], trigram])
print(hitlist) ## eventually save to CSV
</code></pre>