数千个文本文件的高效模糊字符串比较

from sklearn.feature_extraction.text import CountVectorizer import os textfiles = [] newgrams = set() ngrams = [] hitlist = [] path = 'path of folder of textfiles' names = ['john james doe', 'jane jill doe'] vectorizer = CountVectorizer(input = 'filename', ngram_range = (3,3), strip_accents='unicode', stop_words='english', token_pattern='[a-zA-Z\-]\\w*', encoding='utf-8', decode_error = 'replace', lowercase = True) ngramer = vectorizer.build_analyzer() for dirpath, dirnames, filenames in os.walk(path): for files in filenames: if files.endswith('.txt'): textfiles.append(files) ctFiles = len(textfiles) ctNames = len(names) for i in range(ctFiles): newgrams = set(ngramer(path+'/'+textfiles[i])) ngrams.append(newgrams) for i in range(ctNames): splitname = names[i].split() for j in range(ctFiles): tempset = set() for k in range(len(splitname)): if k == 0: ## subset only the trigrams that "match" first name for trigram in ngrams[j]: for word in trigram.split(): if lev(splitname[k], word) < 2: tempset.add(trigram) else: ## search that subset for middle/last name if len(tempset) > 0: for trigram in tempset: for word in trigram.split(): if lev(splitname[k], word) < 2: hitlist.append([names[i], textfiles[j], trigram]) print(hitlist) ## eventually save to CSV

2条回答

网友

1楼 · 编辑于 2024-10-03 02:39:24

我正在使用fuzzyfuzzy，它在我的数据集（100K个句子）https://github.com/seatgeek/fuzzywuzzy上相当快

网友

2楼 · 编辑于 2024-10-03 02:39:24

Levenshtein非常昂贵，我不建议在这么多文档的模糊匹配中使用它（除非您想构建一个Levenshtein自动机来生成一个距离文件中每个单词n步远的标记索引）。在

对于一定长度的单词，三元组索引本身应该是快速和准确的，尽管你提到名字，如果这意味着多个单词，那么块也需要被索引，然后就需要实现。在

如果您尝试单独使用trigram索引，但对准确性不满意，您可以尝试添加一个trigram chunk index aka（Ban，ana，nan）作为元组，将Ban、nan和ana作为单独的三元组，但添加到一个单独的索引中。随着字符长度的减少，精度会有更大的下降，因此应该考虑到这一点。在

这里的关键是levenshtein在O（查询长度^文件中单词的长度*文件中的字数）执行，而token/trigram/chunk索引在O（log（文件中的字数）*查询中的令牌/块/三元组数）执行。在

相关问题更多 >

编程相关推荐

热门问题

热门文章