我们使用Datasketch
的MinHash LSH Forest
实现实现了基于位置敏感散列(LSH)的推荐系统。在
在分析结果之后,我们发现MinHash LSH Forest
(Datasketch)
不推荐与给定查询最接近的匹配。在
我们使用Datasketch
对预测匹配和实际最接近匹配进行了分析(注意:MinHash LSH
与{MinHash LSH
执行得很好,MinHash LSH Forest
生成的结果也不正确。在
def get_forest(df, permutations, p_trees):
lemmatizer = WordNetLemmatizer()
minhash = []
for i in range(len(df)):
text = df.iloc[i,8]
tokens = process_text(text, lemmatizer)
if tokens is None:
pass
m = MinHash(num_perm = perms)
for s in tokens:
m.update(s.encode('utf8'))
minhash.append(m)
forest = MinHashLSHForest(num_perm = perms, l = p_trees)
for i,m in enumerate(minhash):
forest.add(i,m)
forest.index()
return forest
例如, 假设有三个句子
^{pr2}$Sentence A: Query Text.
Sentence B: Closest Match.
Sentence C: Predicted Match.
在哪里
J_AB
: Actual Jaccard Similarity between Sentence A and Sentence B.
J_AC
: Actual Jaccard Similarity between Sentence A and Sentence C.
J_MinHash_AB
: Jaccard Similarity between Sentence A and Sentence B calculated by the MinHash LSH.
J_MinHash_AC
: Jaccard Similarity between Sentence A and Sentence C calculated by the MinHash LSH.
我们有两个问题
为什么Datasketch MinHash LSH Forest
建议C句是A句的近邻而不是B句?
假设MinHash LSH Forest
将产生多个Jaccard相似性,即J_MinHash_AB
,J_MinHash_AC
,J_MinHash_AD
,J_MinHash_AN
可以访问所有这些相似点吗?
目前没有回答
相关问题 更多 >
编程相关推荐