使用SciKi的python文档聚类问题的回答

使用SciKi的python文档聚类

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<blockquote> My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data. </blockquote> 我的第一个建议是，除非绝对必须这样做，否则就不要这样做，因为内存或执行时间问题。 如果必须处理它，则应该使用降维（例如<a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html" rel="nofollow noreferrer">PCA</a>）或<a href="http://scikit-learn.org/stable/modules/feature_selection.html" rel="nofollow noreferrer">feature selection</a>（对于您的情况，可能更好，请参见<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html" rel="nofollow noreferrer">chi2</a>） <blockquote> K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves? </blockquote> 如果您查看<a href="http://scikit-learn.org/stable/modules/clustering.html#dbscan" rel="nofollow noreferrer">the clustering algorithms available in scikit-learn</a>，您会发现并不是所有的集群都要求您指定集群的数量。 另一个没有的是层次聚类，<a href="http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html" rel="nofollow noreferrer">implemented in scipy</a>。另请参见<a href="https://stackoverflow.com/questions/10136470/unsupervised-clustering-with-unknown-number-of-clusters">this answer</a>。 我还建议您使用KMeans并尝试手动调整集群的数量，直到您对结果满意为止。 <blockquote> I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest. </blockquote> Scikit有很多使用文本数据的教程，只需在他们的站点上使用“文本数据”搜索查询。一个是给KMeans的，另一个是监督学习的，但我建议你也复习一下，以便更熟悉图书馆。在我看来，从代码、样式和语法POV来看，无监督和有监督的学习在scikit学习中是非常相似的。 <blockquote> Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm. </blockquote> 这里的小更正是：TF-IDF与集群无关。它只是一种将文本数据转换为数值数据的方法。它不关心你以后如何处理这些数据（聚类、分类、回归、搜索引擎等等）。 我理解您试图传达的信息，但是说“集群是使用TF-IDF完成的”是不正确的。它是使用聚类算法完成的，TF-IDF只在文档聚类中起到预处理的作用。

使用SciKi的python文档聚类

1 个回答

相关Python问题