LDA模型中“好”与“坏”的规定（在Python中使用gensim）

import os, csv #create list with text blocks in rows, based on csv file list=[] with open('Testfile.csv', 'r') as csvfile: emails = csv.reader(csvfile) for row in emails: list.append(row) #create doc_set doc_set=[] for row in list: doc_set.append(row[0]) #import plugins - need to install gensim and stop_words manually for fresh python install from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from nltk.stem.porter import PorterStemmer from gensim import corpora, models import gensim tokenizer = RegexpTokenizer(r'\w+') # create English stop words list en_stop = get_stop_words('en') # Create p_stemmer of class PorterStemmer p_stemmer = PorterStemmer() # list for tokenized documents in loop texts = [] # loop through document list for i in doc_set: # clean and tokenize document string raw = i.lower() tokens = tokenizer.tokenize(raw) # remove stop words from tokens stopped_tokens = [i for i in tokens if not i in en_stop] # stem tokens stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens] # add tokens to list texts.append(stemmed_tokens) # turn our tokenized documents into a id <-> term dictionary dictionary = corpora.Dictionary(texts) # convert tokenized documents into a document-term matrix corpus = [dictionary.doc2bow(text) for text in texts] # generate LDA model ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=10) print(ldamodel.print_topics(num_topics=5, num_words=5)) # map topics to documents doc_lda=ldamodel[corpus] with open('doc_lda.csv', 'w') as outfile: writer = csv.writer(outfile) for row in doc_lda: writer.writerow(row)

1条回答

网友

1楼 · 发布于 2024-05-20 21:28:02

首先对您的具体问题提出一些建议：

a) incorporate the information of “crisis” vs “non-crisis”

为了使用标准的LDA模型来实现这一点，我可能会在doc主题比例和文档是否处于危机/非危机时期之间寻找相互信息。在

b) to automatically chose the optimal number of topics / words to optimize the predictive power of the model?

如果您想正确地做到这一点，请尝试对主题数量进行多种设置，并尝试使用主题模型来预测被搁置文档（主题模型中未包含的文档）的冲突/非冲突。在

有许多主题模型变体可以有效地选择主题的数量（“非参数”模型）。结果发现，带有超参数优化的Mallet实现也有效地做到了这一点，所以我建议使用它（提供大量的主题-超参数优化将导致许多主题只有很少的指定词，这些主题只是噪音）。在

和一些一般性评论：

有很多主题模型变体，特别是一些包含时间的变体。这些可能是一个很好的选择（因为它们比标准的LDA更好地解决随着时间的变化而发生的主题变化，尽管标准LDA是一个很好的起点）。在

我特别喜欢的一个模型使用pitman-yor-word-prior（比dirichlet更好地匹配zipf分布式单词），解释了主题中的突发性，并提供了垃圾主题的线索：https://github.com/wbuntine/topic-models

相关问题更多 >

编程相关推荐

热门问题

热门文章