<p>备选方案之一可以是主题建模,例如潜在Dirichlet分配(LDA)模型</p>
<p>最小的<code>R</code>示例如下所示:</p>
<pre><code>library(topicmodels)
library(tidytext)
library(data.table)
library(tm)
# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]
# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
wordLengths = c(1, Inf)))
# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))
# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')
</code></pre>
<p>Craigslist数据集的好消息是,它为每个职位都有标签(类别),所以你可以构建一种混乱矩阵,如下所示:</p>
<pre><code> category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1: accounting 357 113 1091 194 248 241
2: administrative 595 216 1550 260 372 526
3: customerservice 1142 458 331 329 320 567
4: education 296 263 251 280 1638 578
5: foodbeverage 325 369 287 1578 209 431
6: labor 546 1098 276 324 332 853
</code></pre>
<p>当然,LDA是无监督的,估计的主题不应该与原始类别匹配,但我们观察到,例如<code>labor</code>类别和<code>topic_2</code>之间存在语义交叉</p>