如何使用Python或R对大型文本语料库(如职位列表)进行聚类?

2024-10-02 00:33:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文本语料库——从网络上提取的工作标签列表。列表非常干净,存储为一列CSV文件,其中标题按行列出

我尝试过使用TF-IDF和亲和传播的方法,但这会遇到内存问题。我尝试使用word2vec然后应用一个聚类算法来实现这一点,但没有显示出令人满意的结果。对大约75k个职位的数据集进行聚类的最有效方法是什么


Tags: 文件csv方法内存文本网络标题列表
3条回答

首先,您需要使用tfidf或word2vec等对文本进行矢量化。请参阅下面的tfidf实现: 我跳过了预处理部分,因为它会根据问题陈述的不同而有所不同

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN

df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features)  # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)

sklean文档中提供了集群评估技术: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

备选方案之一可以是主题建模,例如潜在Dirichlet分配(LDA)模型

最小的R示例如下所示:

library(topicmodels)
library(tidytext)
library(data.table)
library(tm)

# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]

# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
                          control = list(removePunctuation = TRUE,
                                         removeNumbers = TRUE,
                                         stopwords = TRUE,
                                         stemming = TRUE,
                                         wordLengths = c(1, Inf)))

# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))

# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')

Craigslist数据集的好消息是,它为每个职位都有标签(类别),所以你可以构建一种混乱矩阵,如下所示:

          category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1:      accounting     357     113    1091     194     248     241
2:  administrative     595     216    1550     260     372     526
3: customerservice    1142     458     331     329     320     567
4:       education     296     263     251     280    1638     578
5:    foodbeverage     325     369     287    1578     209     431
6:           labor     546    1098     276     324     332     853

当然,LDA是无监督的,估计的主题不应该与原始类别匹配,但我们观察到,例如labor类别和topic_2之间存在语义交叉

您可以使用诸如gensim.models.word2vec之类的单词级嵌入对标题进行特征化,然后使用sklearn.cluster.DBSCAN。如果不查看数据集,很难给出更具体的建议

相关问题 更多 >

    热门问题