我如何从句子嵌入中排序向量,并给出它们各自的输入?

2024-10-06 12:31:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我设法为我的两个语料库中的每个句子生成向量,并计算每个可能对之间的余弦相似性(点积):

import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

embeddings1 = ["I'd like an apple juice",
                                "An apple a day keeps the doctor away",
                                 "Eat apple every day",
                                 "We buy apples every week",
                                 "We use machine learning for text classification",
                                 "Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)

embeddings2 = ["I'd like an orange juice",
                                "An orange a day keeps the doctor away",
                                 "Eat orange every day",
                                 "We buy orange every week",
                                 "We use machine learning for document classification",
                                 "Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)

print(cosine_similarity(embeddings1, embeddings2))

array([[ 0.7882168 ,  0.3366559 ,  0.22973989,  0.15428472, -0.10180502,
                                                         -0.04344492],
       [ 0.256085  ,  0.7713026 ,  0.32120776,  0.17834462, -0.10769081,
                                                         -0.09398925],
       [ 0.23850328,  0.446203  ,  0.62606746,  0.25242645, -0.03946173,
                                                         -0.00908459],
       [ 0.24337521,  0.35571027,  0.32963073,  0.6373588 ,  0.08571904,
                                                         -0.01240187],
       [-0.07001016, -0.12002315, -0.02002328,  0.09045915,  0.9141338 ,
                                                          0.8373743 ],
       [-0.04525191, -0.09421931, -0.00631144, -0.00199519,  0.75919366,
                                                          0.9686416 ]]

为了有一个有意义的输出,我需要对它们进行排序,然后用相应的输入语句返回它们。有人知道怎么做吗?我没有找到任何关于该任务的教程


Tags: importappleembedmachinehubwelearningclassification
2条回答

我传递了字符串,而不是字符串的lsit。问题解决了

您可以使用np.argsort(...)进行排序

import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

seq1 = ["I'd like an apple juice",
                                "An apple a day keeps the doctor away",
                                 "Eat apple every day",
                                 "We buy apples every week",
                                 "We use machine learning for text classification",
                                 "Text classification is subfield of machine learning"]
embeddings1 = embed(seq1)

seq2 = ["I'd like an orange juice",
                                "An orange a day keeps the doctor away",
                                 "Eat orange every day",
                                 "We buy orange every week",
                                 "We use machine learning for document classification",
                                 "Text classification is some subfield of machine learning"]
embeddings2 = embed(seq2)

a = cosine_similarity(embeddings1, embeddings2)

def get_pairs(a, b):

 a = np.array(a)
 b = np.array(b)

 c = np.array(np.meshgrid(a, b))
 c = c.T.reshape(len(a), -1, 2)

 return c

pairs = get_pairs(seq1, seq2)

sorted_idx = np.argsort(a, axis=0)[..., None]

sorted_pairs = pairs[sorted_idx]


print(pairs[0, 0])
print(pairs[0, 1])
print(pairs[0, 2])

["I'd like an apple juice" "I'd like an orange juice"]
["I'd like an apple juice" 'An orange a day keeps the doctor away']
["I'd like an apple juice" 'Eat orange every day']

相关问题 更多 >