使用pandas&BERT将余弦相似性公式从一个数据帧循环到另一个数据帧

2024-05-03 07:33:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在构建一个NLP项目,比较两个不同数据帧之间的句子相似性。以下是数据帧的示例:

df = pd.DataFrame({'Element Detail':['Too many competitors in market', 'Highly skilled employees']})
df1 = pd.DataFrame({'Element Details':['Our workers have a lot of talent', 
                                      'this too is a sentence',
                                      'this is very different',
                                      'another sentence is this',
                                      'not much of anything']
                    })

我目前的代码设置方式是将df中的第一个单元格与df1中的所有单元格进行比较。然后,它选择最高的余弦相似性分数,并将其放入具有以下代码的单独数据帧中:

import pandas as pd
import numpy as np

model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df['Element Detail'])
sentence_vecs1 = model.encode(df1['Element Details'])

from sklearn.metrics.pairwise import cosine_similarity

new = cosine_similarity(
    [sentence_vecs[0]],
    sentence_vecs1[0:]
)

d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])

最终产品是此数据帧:

    XNew_ID     X0  
0   1           0.615005 

我如何编写一段代码,使其循环通过df中的其余元素,并以相同的方式将数据写入“最终”数据帧


Tags: 数据代码importdataframedfnewmodelis
1条回答
网友
1楼 · 发布于 2024-05-03 07:33:51

余弦相似性可以在两个列表上很好地执行,因此您可以将整个嵌入列表作为参数传递,然后提取最大相似性

import pandas as pd
import numpy as np

model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df1['Element Detail'])
sentence_vecs1 = model.encode(df2['Element Details'])

from sklearn.metrics.pairwise import cosine_similarity

new = cosine_similarity(
    sentence_vecs,
    sentence_vecs1
)
max_similarities = np.amax(new, axis=1)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
Final

输出:

    XNew_ID     X0          X1
0   1           0.615005    0.868932

相关问题 更多 >