使用SciKi的python文档聚类

网友

1楼 · 编辑于 2024-09-28 17:15:45

My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.

我的第一个建议是，除非绝对必须这样做，否则就不要这样做，因为内存或执行时间问题。

如果必须处理它，则应该使用降维（例如PCA）或feature selection（对于您的情况，可能更好，请参见chi2）

K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?

如果您查看the clustering algorithms available in scikit-learn，您会发现并不是所有的集群都要求您指定集群的数量。

另一个没有的是层次聚类，implemented in scipy。另请参见this answer。

我还建议您使用KMeans并尝试手动调整集群的数量，直到您对结果满意为止。

I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.

Scikit有很多使用文本数据的教程，只需在他们的站点上使用“文本数据”搜索查询。一个是给KMeans的，另一个是监督学习的，但我建议你也复习一下，以便更熟悉图书馆。在我看来，从代码、样式和语法POV来看，无监督和有监督的学习在scikit学习中是非常相似的。

Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.

这里的小更正是：TF-IDF与集群无关。它只是一种将文本数据转换为数值数据的方法。它不关心你以后如何处理这些数据（聚类、分类、回归、搜索引擎等等）。

我理解您试图传达的信息，但是说“集群是使用TF-IDF完成的”是不正确的。它是使用聚类算法完成的，TF-IDF只在文档聚类中起到预处理的作用。

网友

2楼 · 编辑于 2024-09-28 17:15:45

这个链接可能有用。它用可视化输出为http://brandonrose.org/clustering提供了大量的解释

网友

3楼 · 编辑于 2024-09-28 17:15:45

对于TF/IDF变换后的大矩阵，考虑使用稀疏矩阵。
你可以尝试不同的k值。我不是无监督聚类算法的专家，但我敢打赌，有了这样的算法和不同的参数，你也可以得到不同数量的聚类。

相关问题更多 >

编程相关推荐

热门问题

热门文章