我正在构建一个NLP项目,比较两个不同数据帧之间的句子相似性。以下是数据帧的示例:
df = pd.DataFrame({'Element Detail':['Too many competitors in market', 'Highly skilled employees']})
df1 = pd.DataFrame({'Element Details':['Our workers have a lot of talent',
'this too is a sentence',
'this is very different',
'another sentence is this',
'not much of anything']
})
我目前的代码设置方式是将df中的第一个单元格与df1中的所有单元格进行比较。然后,它选择最高的余弦相似性分数,并将其放入具有以下代码的单独数据帧中:
import pandas as pd
import numpy as np
model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df['Element Detail'])
sentence_vecs1 = model.encode(df1['Element Details'])
from sklearn.metrics.pairwise import cosine_similarity
new = cosine_similarity(
[sentence_vecs[0]],
sentence_vecs1[0:]
)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
最终产品是此数据帧:
XNew_ID X0
0 1 0.615005
我如何编写一段代码,使其循环通过df中的其余元素,并以相同的方式将数据写入“最终”数据帧
余弦相似性可以在两个列表上很好地执行,因此您可以将整个嵌入列表作为参数传递,然后提取最大相似性
输出:
相关问题 更多 >
编程相关推荐