如何在获取TFIDF,cosine\u相似度后显示文档ID?python

2024-10-01 07:23:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我计算查询字符串和一些文档的TF-IDF。 我想计算余弦相似度并显示文档ID列表,从最相关的查询到不相关的查询。在

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
## load the documents (around 200 txt) from path
cranInp=[]
path="D:\\Desktop\\try\\web"
for file in os.listdir(path):
    textdir=path+"\\"+file
    f=open(textdir).read()
    # print f
    cranInp.append(f)


Vcount = TfidfVectorizer(analyzer='word', ngram_range=(1,1), stop_words = 'english')
countMatrix = Vcount.fit_transform(cranInp)


 Query = "in summarizing theoretical and experimental work on the behaviour of a typical aircraft structure in a noise environment is it possible to develop a design procedure ."
 queryVects  = Vcount.transform(Query)

k = 50
cosMattf = cosine_similarity(queryVects,countMatrix)

如何获取前K(K=50)文档列表,如[12.txt,34.txt,89.txt,90.txt….45.txt]列表大小为50。在

从最相关到不相关如12.txt具有最小的余弦距离,它是与查询最相关的文档。在


Tags: thepathinfrom文档importtxt列表