擅长:python、mysql、java
<p>首先,您需要使用tfidf或word2vec等对文本进行矢量化。请参阅下面的tfidf实现:
我跳过了预处理部分,因为它会根据问题陈述的不同而有所不同</p>
<pre><code>import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features) # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)
</code></pre>
<p>sklean文档中提供了集群评估技术:
<a href="https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation" rel="nofollow noreferrer">https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation</a></p>