使用spacy（python）记录相似性

2条回答

网友

1楼 · 编辑于 2024-10-05 14:27:29

此代码计算两个或多个文本文件的相似性：

import spacy
import os
import glob

spacy.prefer_gpu()

nlp = spacy.load('pt_core_news_lg') # or nlp = spacy.load('en_core_web_lg')

def get_file_contents(filename):
    try:
        with open(filename, 'r') as filehandle:  
            filecontent = filehandle.read()
            return (filecontent) 
    except Exception as e:
        print(e)

try:
    used = []
    for arquivo1 in glob.glob("F:\\summary\\RESUMO\\*.txt"):
        used.append(arquivo1)
        for arquivo2 in glob.glob("F:\\summary\\RESUMO\\*.txt"):
            if str(arquivo2) not in used:                
                print(arquivo1 + " vs " + arquivo2)
                fn1_doc=get_file_contents(arquivo1)
                doc1 = nlp(fn1_doc)
                fn2_doc=get_file_contents(arquivo2)
                doc2 = nlp(fn2_doc)
                print ("similarity = " + str("%.2f" % (float(doc1.similarity(doc2))*100)) + "%\n") 
except Exception as e:
    print(e)

网友

2楼 · 编辑于 2024-10-05 14:27:29

你的代码没有问题。spaCy中的句子相似性基于单词嵌入，而单词嵌入的一个众所周知的弱点是它们很难区分同义词（happy-joyous）和反义词（happy-sad）

根据您的数字，您可能已经在这样做了，但请确保您正在使用spaCy的大型英语模型en_core_web_lg，以获得最佳的单词嵌入

为了更准确地嵌入完整的句子，可能值得一试谷歌的通用句子编码器。见：https://tfhub.dev/google/universal-sentence-encoder/4

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用spacy（python）记录相似性

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >