Gensim矩阵相似性与num_最佳返回负相似性

2024-09-28 22:43:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在为我的gensim LSI模型建立一个相似性索引。你知道吗

两个索引建立在不同的语料库上。大型语料库(400k文本)的索引工作正常,但小型语料库(10k文本)的索引却很奇怪。你知道吗


from gensim.similarities import MatrixSimilarity
from gensim.models import LsiModel

# corpus_bow[0] = [(0, 1), (1, 1), (2, 1)......]
# len(corpus_bow): 10k

model = LsiModel(corpus_bow, num_topics = 400)
transformed = model[corpus_bow]
index = MatrixSimilarity(transformed, num_best = 40)

import numpy as np
print(index[np.random.rand(400)])
[(486, -0.20494145154953003),
 (458, -0.20494145154953003),
 (503, -0.20494145154953003),
 (1732, -0.17177608609199524),
 (4432, -0.1629350632429123),
 (4714, -0.1629350632429123),
 (4537, 0.16059553623199463),
 (1103, -0.15242460370063782),
 (1014, -0.1514768898487091),
 (8915, 0.15007290244102478),
 (8901, 0.15007290244102478),
 (5387, 0.14965689182281494),
 (6508, -0.14788493514060974),
 (1765, -0.14788493514060974),
 (1744, -0.14788493514060974),
 (3650, 0.14648930728435516),
 (1120, 0.14613059163093567),
 (6345, -0.14582915604114532),
 (2234, -0.14568257331848145),
 (5361, 0.14548826217651367),
 (2558, 0.1453399807214737),
 (4711, -0.1450151801109314),
 (4445, -0.1450151801109314),
 (8870, 0.14448410272598267),
 (8862, 0.14448410272598267),
 (8851, -0.14395320415496826),
 (8824, -0.14391806721687317),
 (9351, -0.1435808539390564),
 (4078, 0.14126691222190857),
 (8006, 0.14126691222190857),
 (8324, -0.14036694169044495),
 (4186, 0.14015284180641174),
 (4934, 0.14015284180641174),
 (3941, -0.1399686485528946),
 (3929, -0.1399686485528946),
 (5173, -0.13947832584381104),
 (3774, -0.13858705759048462),
 (4410, -0.13752153515815735),
 (4442, -0.13752153515815735),
 (4705, -0.13752153515815735)]

结果是按绝对值排序,这很奇怪。它给了我最不相似的结果和最相似的结果。你知道吗


Tags: from文本importindexmodelnpcorpusnum