如何根据句子相似度对句子进行聚类和绘图？

def tsnescatterplot(sentences): arr = np.empty((0, 512), dtype='f') word_labels = [] for sentence in sentences: wrd_vector = get_elmo_embeddings(sentence) print(sentence) word_labels.append(sentence) arr = np.append(arr, np.array([wrd_vector]), axis=0) print('Printing array') print(arr) # find tsne coords for 2 dimensions tsne = TSNE(n_components=2, random_state=0) np.set_printoptions(suppress=True) Y = tsne.fit_transform(arr) x_coords = Y[:, 0] y_coords = Y[:, 1] # display scatter plot plt.scatter(x_coords, y_coords) for label, x, y in zip(word_labels, x_coords, y_coords): plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points') plt.xlim(x_coords.min() + 0.5, x_coords.max() + 0.5) plt.ylim(y_coords.min() + 0.5, y_coords.max() + 0.5) plt.show() def dbscan_scatterplot(sentences): arr = np.empty((0, 512), dtype='f') for sentence in sentences: wrd_vector = get_elmo_embeddings(sentence) arr = np.append(arr, np.array([wrd_vector]), axis=0) dbscan = DBSCAN() np.set_printoptions(suppress=True) Y = dbscan.fit(arr)

1条回答

网友

1楼 · 发布于 2024-09-27 18:56:55

DBSCAN参数的选择具有重要意义。在

它提供默认值是愚蠢的，因为这些值只适用于低维的玩具数据。相反，它们应该要求用户指定值。在

你需要适当地选择特别是epsilon。但对于高维数据，很难选择这种井。您将发现结果突然从all-1（没有聚集）变为all-0（所有连接的内容），选择一个好的值很困难。在文献中有一些关于这一点的启发，你需要去探索。在

最后但并非最不重要的是，平均词向量往往会产生相当的坏结果。因为他们都朝着中庸的方向发展。越长的文档越接近平均值，越短的文档越远。但这并不是你想要的集群。。。这种额外的失真可能足以破坏你以前的信号。在

相关问题更多 >

编程相关推荐

热门问题

热门文章