我看到了gensim
文档,但遇到了以下问题:
-一旦我得到一个mXj
相似性矩阵,其中m
是文档的数量,j
是唯一单词的总数,我不知道如何提取最相似的N
文档
-长期目标是以xlsx
或csv
格式存储和保存,但这是另一个问题
这里the example from the documentation使用Similarity
from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
index_tmpfile = get_tmpfile("index")
query = [(1, 2), (6, 1), (7, 2)]
index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
similarities = index[query] # get similarities between the query and all index documents
在那之后,我得到了10个最相似的文档
预期产出:
arr=[('This is the most similar',0.99),('This is one of the most similar',0.98),('This is another very similar docs,0.98)]
或
arr=[['This is the most similar',0.99],['This is one of the most similar',0.98],['This is another very similar docs,0.98]]
在我个人的情况下,我使用similarities.SparseMatrixSimilarity
有以下代码:
#type here the query
query_vector='house is clean'
tokenized_query_vector = regexp_tokenize(query_vector,r"\w+")
#Create an object of corpora.Dictionary()
dictionary = corpora.Dictionary()
#Passing to dictionary.doc2bow() object
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in tokenized_lines]
feature_cnt = len(dictionary.token2id)
#n=raw
#t=zero corrected idf
#c=cosine
tfidf = TfidfModel(BoW_corpus, smartirs='ntc')
query_vector = dictionary.doc2bow(tokenized_query_vector)
index = similarities.SparseMatrixSimilarity(tfidf[BoW_corpus], num_features = feature_cnt)
#this way I get the similarity between the query and the similarity matrix
sim = index[tfidf[query_vector]]
目前没有回答
相关问题 更多 >
编程相关推荐