当数据集增加时，sklearn匹配结果变得不对齐

nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf) unique_org = set(names['VariationName'].values) # set used for increased performance #matching query: def getNearestN(query): queryTFIDF_ = vectorizer.transform(query) distances, indices = nbrs.kneighbors(queryTFIDF_) return distances, indices print('Getting nearest n...') distances, indices = getNearestN(unique_org) unique_org = list(unique_org) #need to convert back to a list print('Finding matches...') matches = [] for i,j in enumerate(indices): temp = [round(distances[i][0],2), clean_org_names.values[j][0][0],unique_org[i]] matches.append(temp) print('Building data frame...') matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','Matched name','Original name']) print('Data frame built')

1条回答

网友

1楼 · 发布于 2024-09-29 23:31:00

集合一般不保证顺序的保留。因此getNearestN遍历unique_org的顺序可能与list构造函数的顺序不同：

distances, indices = getNearestN(unique_org)  # computed distances with respect to an unordered set

unique_org = list(unique_org) # `unique_org` was potentially shuffled here

相反，用列表试试看是否有效。如果列表速度慢得多，我怀疑罪魁祸首是重复的名称，而不是集合更适合这项工作。您可以在pandas（names['VariationName'].unique()）或香草python（list(set(names['VariationName']))）中处理重复项

因此，总而言之，我要确保我没有重复的（可能使用熊猫），然后使用整个列表，看看它是否有效

资料来源：

A set object is an unordered collection of distinct hashable objects.

python docs

相关问题更多 >

编程相关推荐

热门问题

热门文章