如何使用Python或R对大型文本语料库（如职位列表）进行聚类？问题的回答

如何使用Python或R对大型文本语料库（如职位列表）进行聚类？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

备选方案之一可以是主题建模，例如潜在Dirichlet分配（LDA）模型 最小的<code>R</code>示例如下所示： <pre><code>library(topicmodels) library(tidytext) library(data.table) library(tm) # Reading Craigslist job titles jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv') jobs[, doc_id := 1:.N] # Building a text corpus dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])), control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE, stemming = TRUE, wordLengths = c(1, Inf))) # Let's set number of topics to be equal to number of categories and fit LDA model n_topics <- length(unique(jobs[, category])) lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4)) # Kind of confusion matrix to inspect relevance docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)] docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)] dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N') </code></pre> Craigslist数据集的好消息是，它为每个职位都有标签（类别），所以你可以构建一种混乱矩阵，如下所示： <pre><code> category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6 1: accounting 357 113 1091 194 248 241 2: administrative 595 216 1550 260 372 526 3: customerservice 1142 458 331 329 320 567 4: education 296 263 251 280 1638 578 5: foodbeverage 325 369 287 1578 209 431 6: labor 546 1098 276 324 332 853 </code></pre> 当然，LDA是无监督的，估计的主题不应该与原始类别匹配，但我们观察到，例如<code>labor</code>类别和<code>topic_2</code>之间存在语义交叉

如何使用Python或R对大型文本语料库（如职位列表）进行聚类？

1 个回答

相关Python问题