我计算查询字符串和一些文档的TF-IDF。 我想计算余弦相似度并显示文档ID列表,从最相关的查询到不相关的查询。在
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
## load the documents (around 200 txt) from path
cranInp=[]
path="D:\\Desktop\\try\\web"
for file in os.listdir(path):
textdir=path+"\\"+file
f=open(textdir).read()
# print f
cranInp.append(f)
Vcount = TfidfVectorizer(analyzer='word', ngram_range=(1,1), stop_words = 'english')
countMatrix = Vcount.fit_transform(cranInp)
Query = "in summarizing theoretical and experimental work on the behaviour of a typical aircraft structure in a noise environment is it possible to develop a design procedure ."
queryVects = Vcount.transform(Query)
k = 50
cosMattf = cosine_similarity(queryVects,countMatrix)
如何获取前K(K=50)文档列表,如[12.txt,34.txt,89.txt,90.txt….45.txt]列表大小为50。在
从最相关到不相关如12.txt具有最小的余弦距离,它是与查询最相关的文档。在
目前没有回答
相关问题 更多 >
编程相关推荐