<p>我将在这里补充另一个答案。我想我的第一个答案非常正确。然而,我确实找到了一种使用K-means对文本进行聚类的方法,因此我将在这里分享,因为我正在寻找有关该技术“正确性”的反馈</p>
<pre><code>from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
print("\n")
print("Prediction")
Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)
Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)
</code></pre>
<p>结果:</p>
<pre><code>Top terms per cluster:
Cluster 0:
eating
kitty
little
came
restaurant
play
ve
feedback
face
extension
Cluster 1:
translate
app
incredible
google
eating
impressed
feedback
face
extension
ve
Cluster 2:
climbing
ninja
cat
eating
impressed
google
feedback
face
extension
ve
Cluster 3:
kitten
belly
squooshy
merley
best
eating
google
feedback
face
extension
Cluster 4:
100
open
tab
smiley
face
google
feedback
extension
eating
climbing
Cluster 5:
chrome
extension
promoter
key
google
eating
impressed
feedback
face
ve
Cluster 6:
impressed
map
feedback
google
ve
eating
face
extension
climbing
key
Cluster 7:
ve
taken
photo
best
cat
eating
google
feedback
face
extension
</code></pre>