<p>下面是一个具体的示例,演示如何将<code>KMeans</code>集群ID与训练数据标签匹配。基本思想是<code>confusion_matrix</code>在其对角线上应有较大的值,前提是分类正确。以下是将群集中心ID与培训标签关联之前的混淆矩阵:</p>
<pre><code>cm =
array([[ 0, 395, 0, 5, 0],
[ 0, 2, 5, 391, 2],
[ 2, 0, 0, 0, 398],
[ 0, 0, 400, 0, 0],
[398, 0, 0, 0, 2]])
</code></pre>
<p>现在我们只需要重新排列混淆矩阵,使其大值重新定位在对角线上。它可以很容易地实现</p>
<pre><code>cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])
</code></pre>
<p>这里我们得到了新的混乱矩阵,现在看起来很熟悉,对吗</p>
<pre><code>cm_ =
array([[395, 5, 0, 0, 0],
[ 2, 391, 2, 5, 0],
[ 0, 0, 398, 0, 2],
[ 0, 0, 0, 400, 0],
[ 0, 0, 2, 0, 398]])
</code></pre>
<p>您可以使用<code>accuracy_score</code>进一步验证结果</p>
<pre><code>y_pred_ = np.array([cm_argmax[i] for i in y_pred])
accuracy_score(y,y_pred_)
# 0.991
</code></pre>
<p>完整的独立代码如下所示:</p>
<pre><code>import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import confusion_matrix,accuracy_score
blob_centers = np.array(
[[ 0.2, 2.3],
[-1.5 , 2.3],
[-2.8, 1.8],
[-2.8, 2.8],
[-2.8, 1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers,
cluster_std=blob_std, random_state=7)
def plot_clusters(X, y=None):
plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
plt.xlabel("$x_1$", fontsize=14)
plt.ylabel("$x_2$", fontsize=14, rotation=0)
plt.figure(figsize=(8, 4))
plot_clusters(X)
plt.show()
k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)
cm = confusion_matrix(y, y_pred)
cm
cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])
cm_ = confusion_matrix(y, y_pred)
cm_
accuracy_score(y,y_pred_)
</code></pre>