Sklearn余弦相似度字符串，Python

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity example_1 = ("I am okey", "I am okeu") example_2 = ("I am okey", "I am crazy") tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(example_1) result_cos = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix) print(result_cos[0][1])

1条回答

网友

1楼 · 发布于 2024-09-30 18:28:08

对于短字符串，Levenshtein distance可能会产生比基于单词的余弦相似性更好的结果。下面的算法改编自Wikibooks。因为这是一个距离指标，得分越小越好。在

def levenshtein(s1, s2):
    if len(s1) < len(s2):
        s1, s2 = s2, s1

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]/float(len(s1))

example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")

print(levenshtein(*example_1))
print(levenshtein(*example_2))

相关问题更多 >

编程相关推荐

热门问题

热门文章

Sklearn余弦相似度字符串，Python

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >