我正在为我的gensim LSI模型建立一个相似性索引。你知道吗
两个索引建立在不同的语料库上。大型语料库(400k文本)的索引工作正常,但小型语料库(10k文本)的索引却很奇怪。你知道吗
from gensim.similarities import MatrixSimilarity
from gensim.models import LsiModel
# corpus_bow[0] = [(0, 1), (1, 1), (2, 1)......]
# len(corpus_bow): 10k
model = LsiModel(corpus_bow, num_topics = 400)
transformed = model[corpus_bow]
index = MatrixSimilarity(transformed, num_best = 40)
import numpy as np
print(index[np.random.rand(400)])
[(486, -0.20494145154953003),
(458, -0.20494145154953003),
(503, -0.20494145154953003),
(1732, -0.17177608609199524),
(4432, -0.1629350632429123),
(4714, -0.1629350632429123),
(4537, 0.16059553623199463),
(1103, -0.15242460370063782),
(1014, -0.1514768898487091),
(8915, 0.15007290244102478),
(8901, 0.15007290244102478),
(5387, 0.14965689182281494),
(6508, -0.14788493514060974),
(1765, -0.14788493514060974),
(1744, -0.14788493514060974),
(3650, 0.14648930728435516),
(1120, 0.14613059163093567),
(6345, -0.14582915604114532),
(2234, -0.14568257331848145),
(5361, 0.14548826217651367),
(2558, 0.1453399807214737),
(4711, -0.1450151801109314),
(4445, -0.1450151801109314),
(8870, 0.14448410272598267),
(8862, 0.14448410272598267),
(8851, -0.14395320415496826),
(8824, -0.14391806721687317),
(9351, -0.1435808539390564),
(4078, 0.14126691222190857),
(8006, 0.14126691222190857),
(8324, -0.14036694169044495),
(4186, 0.14015284180641174),
(4934, 0.14015284180641174),
(3941, -0.1399686485528946),
(3929, -0.1399686485528946),
(5173, -0.13947832584381104),
(3774, -0.13858705759048462),
(4410, -0.13752153515815735),
(4442, -0.13752153515815735),
(4705, -0.13752153515815735)]
结果是按绝对值排序,这很奇怪。它给了我最不相似的结果和最相似的结果。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐