Datasketch MinHash LSH林正在生成不正确的resu

2024-09-21 05:32:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我们使用DatasketchMinHash LSH Forest实现实现了基于位置敏感散列(LSH)的推荐系统。在

在分析结果之后,我们发现MinHash LSH Forest(Datasketch)不推荐与给定查询最接近的匹配。在

我们使用Datasketch对预测匹配和实际最接近匹配进行了分析(注意:MinHash LSH与{}不同),发现即使MinHash LSH执行得很好,MinHash LSH Forest生成的结果也不正确。在

def get_forest(df, permutations, p_trees):
    lemmatizer = WordNetLemmatizer()
    minhash = []
    for i in range(len(df)):
        text = df.iloc[i,8]
        tokens = process_text(text, lemmatizer)
        if tokens is None:
            pass
        m = MinHash(num_perm = perms)
        for s in tokens:
            m.update(s.encode('utf8'))
        minhash.append(m)
    forest = MinHashLSHForest(num_perm = perms, l = p_trees)
    for i,m in enumerate(minhash):
        forest.add(i,m)
    forest.index()
    return forest

例如, 假设有三个句子

Sentence A: Query Text.

Sentence B: Closest Match.

Sentence C: Predicted Match.

^{pr2}$

在哪里

J_AB: Actual Jaccard Similarity between Sentence A and Sentence B.

J_AC: Actual Jaccard Similarity between Sentence A and Sentence C.

J_MinHash_AB: Jaccard Similarity between Sentence A and Sentence B calculated by the MinHash LSH.

J_MinHash_AC: Jaccard Similarity between Sentence A and Sentence C calculated by the MinHash LSH.

我们有两个问题

  1. 为什么Datasketch MinHash LSH Forest建议C句是A句的近邻而不是B句?

  2. 假设MinHash LSH Forest将产生多个Jaccard相似性,即J_MinHash_ABJ_MinHash_ACJ_MinHash_ADJ_MinHash_AN可以访问所有这些相似点吗?


Tags: andtextindfforbetweensentencesimilarity

热门问题