使用特征哈希进行聚类

doc_a = { "category": "election, law, politics, civil, government", "expertise": "political science, civics, republican" } doc_b = { "category": "Computers, optimization", "expertise": "computer science, graphs, optimization" } doc_c = { "category": "Election, voting", "expertise": "political science, republican" } doc_d = { "category": "Engineering, Software, computers", "expertise": "computers, programming, optimization" } doc_e = { "category": "International trade, politics", "expertise": "civics, political activist" }

1条回答

网友

1楼 · 发布于 2024-09-28 03:19:24

如果您使用HashingVectorizer而不是FeatureHasher来解决这个问题，您会使事情变得更简单。HashingVectorizer负责将输入数据标记化，并可以接受字符串列表。在

这个问题的主要挑战是您实际上有两种文本特征，category和{}。这种情况下的诀窍是为这两个特性安装一个哈希向量器，然后组合输出：

from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import hstack
from sklearn.cluster import KMeans

docs = [doc_a,doc_b, doc_c, doc_d, doc_e]

# vectorize both fields separately
category_vectorizer = HashingVectorizer()
Xc = category_vectorizer.fit_transform([doc["category"] for doc in docs])

expertise_vectorizer = HashingVectorizer()
Xe = expertise_vectorizer.fit_transform([doc["expertise"] for doc in docs])

# combine the features into a single data set
X = hstack((Xc,Xe))
print("X: %d x %d" % X.shape)
print("Xc: %d x %d" % Xc.shape)
print("Xe: %d x %d" % Xe.shape)

# fit a cluster model
km = KMeans(n_clusters=2)

# predict the cluster
for k,v in zip(["a","b","c","d", "e"], km.fit_predict(X)):
    print("%s is in cluster %d" % (k,v))

相关问题更多 >

编程相关推荐

热门问题

热门文章