如何用pyclustering聚类文本

2024-05-18 11:05:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用pycluster库对20个新闻组文本进行聚类:https://codedocs.xyz/annoviko/pyclustering/classpyclustering_1_1cluster_1_1cure_1_1cure.html#details 例如治疗。据我所知,它需要这样的输入:[[0.1,0.5],[0.3,0.1]。。。]. 我可以用scikit TfidfVectorizer或其他什么方法来实现这一点吗?所需的值是矢量器圆括号中的值吗(例如(338615161)) 到目前为止我的代码是:

到目前为止,我尝试了矢量器与它,但它没有工作。你知道吗

categories = [
     'alt.atheism',
     'talk.religion.misc',
     'comp.graphics',
     'sci.space',
 ]

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

vectorizer = TfidfVectorizer(max_df=0.5,min_df=2, stop_words='english')
X = vectorizer.fit_transform(dataset.data)
print(X)
X = X.toarray()
# Allocate three clusters.
cure_instance = cure(X, 100);
cure_instance.process();
clusters = cure_instance.get_clusters();
# Visualize allocated clusters.
visualizer = cluster_visualizer();
visualizer.append_clusters(clusters, X);
visualizer.show();

我只想用sklearn-Birch把文本分类。现在它只是被杀死了。你知道吗


Tags: instance文本df矢量聚类datasetcategoriesclusters