<blockquote>
<p>My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.</p>
</blockquote>
<p>我的第一个建议是,除非绝对必须这样做,否则就不要这样做,因为内存或执行时间问题。</p>
<p>如果必须处理它,则应该使用降维(例如<a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html" rel="nofollow noreferrer">PCA</a>)或<a href="http://scikit-learn.org/stable/modules/feature_selection.html" rel="nofollow noreferrer">feature selection</a>(对于您的情况,可能更好,请参见<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html" rel="nofollow noreferrer">chi2</a>)</p>
<blockquote>
<p>K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?</p>
</blockquote>
<p>如果您查看<a href="http://scikit-learn.org/stable/modules/clustering.html#dbscan" rel="nofollow noreferrer">the clustering algorithms available in scikit-learn</a>,您会发现并不是所有的集群都要求您指定集群的数量。</p>
<p>另一个没有的是层次聚类,<a href="http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html" rel="nofollow noreferrer">implemented in scipy</a>。另请参见<a href="https://stackoverflow.com/questions/10136470/unsupervised-clustering-with-unknown-number-of-clusters">this answer</a>。</p>
<p>我还建议您使用KMeans并尝试手动调整集群的数量,直到您对结果满意为止。</p>
<blockquote>
<p>I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.</p>
</blockquote>
<p>Scikit有很多使用文本数据的教程,只需在他们的站点上使用“文本数据”搜索查询。一个是给KMeans的,另一个是监督学习的,但我建议你也复习一下,以便更熟悉图书馆。在我看来,从代码、样式和语法POV来看,无监督和有监督的学习在scikit学习中是非常相似的。</p>
<blockquote>
<p>Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.</p>
</blockquote>
<p>这里的小更正是:TF-IDF与集群无关。它只是一种将文本数据转换为数值数据的方法。它不关心你以后如何处理这些数据(聚类、分类、回归、搜索引擎等等)。</p>
<p>我理解您试图传达的信息,但是说“集群是使用TF-IDF完成的”是不正确的。它是使用聚类算法完成的,TF-IDF只在文档聚类中起到预处理的作用。</p>