如何使用python使用KMeans将标签集群与真实标签匹配问题的回答

如何使用python使用KMeans将标签集群与真实标签匹配

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我对使用Kmeans算法的标签数据有问题。我的测试句子得到了真正的聚类，但我没有得到真正的标签。我已经使用numpy将集群与真标签匹配，但是这个kmeans可以移动集群，真标签与集群的数量不匹配。我需要帮助解决这个问题。这是我的密码 <pre><code>from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.neighbors import KNeighborsClassifier from sklearn.cluster import KMeans from nltk.corpus import stopwords from nltk.stem.wordnet import WordNetLemmatizer import string import re import numpy as np from collections import Counter stop = set(stopwords.words('indonesian')) exclude = set(string.punctuation) lemma = WordNetLemmatizer() # Cleaning the text sentences so that punctuation marks, stop words & digits are removed def clean(doc): stop_free = " ".join([i for i in doc.lower().split() if i not in stop]) punc_free = ''.join(ch for ch in stop_free if ch not in exclude) normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) processed = re.sub(r"\d+","",normalized) y = processed.split() #print (y) return y path = "coba.txt" train_clean_sentences = [] fp = open(path,'r') for line in fp: line = line.strip() cleaned = clean(line) cleaned = ' '.join(cleaned) train_clean_sentences.append(cleaned) #print(train_clean_sentences) vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(train_clean_sentences) # Clustering the training 30 sentences with K-means technique modelkmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, n_init=100) modelkmeans.fit(X) teks_satu = "Aplikasi Machine Learning untuk mengenali daun mangga dengan metode CNN" test_clean_sentence = [] cleaned_test = clean(teks_satu) cleaned = ' '.join(cleaned_test) cleaned = re.sub(r"\d+","",cleaned) test_clean_sentence.append(cleaned) Test = vectorizer.transform(test_clean_sentence) true_test_labels = ['AI','VR','Sistem Informasi'] predicted_labels_kmeans = modelkmeans.predict(Test) print(predicted_labels_kmeans) print ("\n-------------------------------PREDICTIONS BY K-Means--------------------------------------") print ("\nIndex of Virtual Reality : ",Counter(modelkmeans.labels_[5:10]).most_common(1)[0][0]) print ("Index of Machine Learning : ",Counter(modelkmeans.labels_[0:5]).most_common(1)[0][0]) print ("Index of Sistem Informasi : ",Counter(modelkmeans.labels_[10:15]).most_common(1)[0][0]) print ("\n",teks_satu,":",true_test_labels[np.int(predicted_labels_kmeans)],":",predicted_labels_kmeans) </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

下面是一个具体的示例，演示如何将<code>KMeans</code>集群ID与训练数据标签匹配。基本思想是<code>confusion_matrix</code>在其对角线上应有较大的值，前提是分类正确。以下是将群集中心ID与培训标签关联之前的混淆矩阵： <pre><code>cm = array([[ 0, 395, 0, 5, 0], [ 0, 2, 5, 391, 2], [ 2, 0, 0, 0, 398], [ 0, 0, 400, 0, 0], [398, 0, 0, 0, 2]]) </code></pre> 现在我们只需要重新排列混淆矩阵，使其大值重新定位在对角线上。它可以很容易地实现 <pre><code>cm_argmax = cm.argmax(axis=0) cm_argmax y_pred_ = np.array([cm_argmax[i] for i in y_pred]) </code></pre> 这里我们得到了新的混乱矩阵，现在看起来很熟悉，对吗 <pre><code>cm_ = array([[395, 5, 0, 0, 0], [ 2, 391, 2, 5, 0], [ 0, 0, 398, 0, 2], [ 0, 0, 0, 400, 0], [ 0, 0, 2, 0, 398]]) </code></pre> 您可以使用<code>accuracy_score</code>进一步验证结果 <pre><code>y_pred_ = np.array([cm_argmax[i] for i in y_pred]) accuracy_score(y,y_pred_) # 0.991 </code></pre> 完整的独立代码如下所示： <pre><code>import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs from sklearn.metrics import confusion_matrix,accuracy_score blob_centers = np.array( [[ 0.2, 2.3], [-1.5 , 2.3], [-2.8, 1.8], [-2.8, 2.8], [-2.8, 1.3]]) blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1]) X, y = make_blobs(n_samples=2000, centers=blob_centers, cluster_std=blob_std, random_state=7) def plot_clusters(X, y=None): plt.scatter(X[:, 0], X[:, 1], c=y, s=1) plt.xlabel("$x_1$", fontsize=14) plt.ylabel("$x_2$", fontsize=14, rotation=0) plt.figure(figsize=(8, 4)) plot_clusters(X) plt.show() k = 5 kmeans = KMeans(n_clusters=k, random_state=42) y_pred = kmeans.fit_predict(X) cm = confusion_matrix(y, y_pred) cm cm_argmax = cm.argmax(axis=0) cm_argmax y_pred_ = np.array([cm_argmax[i] for i in y_pred]) cm_ = confusion_matrix(y, y_pred) cm_ accuracy_score(y,y_pred_) </code></pre>

如何使用python使用KMeans将标签集群与真实标签匹配

1 个回答

相关Python问题