如何正确移除ScikitLearn的DPGMM的冗余组件？问题的回答

如何正确移除ScikitLearn的DPGMM的冗余组件？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在使用scikit learn实现Dirichlet过程高斯混合模型： <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py" rel="nofollow noreferrer">https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py</a> <a href="http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html" rel="nofollow noreferrer">http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html</a> 也就是说，它是<code>sklearn.mixture.BayesianGaussianMixture()</code>，默认设置为<code>weight_concentration_prior_type = 'dirichlet_process'</code>。与k-means（用户预先设置簇数k）不同，DPGMM是一个无限混合模型，Dirichlet过程是一个关于簇数的先验分布。在 我的DPGMM模型始终将精确的集群数量输出为<code>n_components</code>。如本文所述，正确的处理方法是使用<code>predict(X)</code>来“减少冗余组件”： <a href="https://stackoverflow.com/questions/38528311/scikit-learns-dpgmm-fitting-number-of-components">Scikit-Learn's DPGMM fitting: number of components?</a> 但是，链接到的示例实际上并没有删除冗余组件并显示数据中“正确”的集群数量。相反，它只是绘制正确的簇数。在 <a href="http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html" rel="nofollow noreferrer">http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html</a> 用户如何实际删除冗余组件，并输出一个数组，这些组件应该是哪些？这是删除冗余集群的“官方”/唯一方法吗？在 这是我的代码： <pre><code>>>> import pandas as pd >>> import numpy as np >>> import random >>> from sklearn import mixture >>> X = pd.read_csv(....) # my matrix >>> X.shape (20000, 48) >>> dpgmm3 = mixture.BayesianGaussianMixture(n_components = 20, weight_concentration_prior_type='dirichlet_process', max_iter = 1000, verbose = 2) >>> dpgmm3.fit(X) # Fitting the DPGMM model >>> labels = dpgmm3.predict(X) # Generating labels after model is fitted >>> max(labels) >>> np.unique(labels) #Number of lab els == n_components specified above array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]) #Trying with a different n_components >>> dpgmm3_1 = mixture.BayesianGaussianMixture( weight_concentration_prior_type='dirichlet_process', max_iter = 1000) #not specifying n_components >>> dpgmm3_1.fit(X) >>> labels_1 = dpgmm3_1.predict(X) >>> labels_1 array([0, 0, 0, ..., 0, 0, 0]) #All were classified under the same label #Trying with n_components = 7 >>> dpgmm3_2 = mixture.BayesianGaussianMixture(n_components = 7, weight_concentration_prior_type='dirichlet_process', max_iter = 1000) >>> dpgmm3_2.fit() >>> labels_2 = dpgmm3_2.predict(X) >>> np.unique(labels_2) array([0, 1, 2, 3, 4, 5, 6]) #number of labels == n_components </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如何正确移除ScikitLearn的DPGMM的冗余组件？

1 个回答

相关Python问题