所以,我在搞乱gensim,我让它打印出前5个主题和与主题相关的流行名词(这是用这里的示例Topic Distribution and clustering using LDA)完成的。我的案子里有51份文件。我很难让我的最后两个集群工作,因为我一直收到“列表索引超出范围”错误。我完全不知道我可以做些什么改变来修复我的集群。我尝试使用if和else条件的方法给出了一个不正确的第一个集群(您会发现它被注释掉了)。我到底哪里错了?在
from gensim import corpora, models, similarities
from itertools import chain
# list of tokenised nouns from the noun documents
nounTokens = []
for index, row in df_Data.iterrows():
nounTokens.append(df_Data.iloc[index]['Noun Tokens'])
# create a dictionary using noun Tokens
id2word = corpora.Dictionary(nounTokens)
# creates the bag of word corpus
mm = [id2word.doc2bow(noun) for noun in nounTokens]
# trains lda models
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=5, update_every=1, chunksize=10000, passes=1)
# prints the topics of the corpus
for topics in lda.print_topics():
print(topics)
print
lda_corpus = lda[mm]
# search for scores of all the words under each topic for all documents
scores = list(chain(*[[score for topic_id, score in topic]
for topic in [doc for doc in lda_corpus]]))
# calculating the avg sum of all the probabilities to ensure we have a valid threshold.
threshold = sum(scores)/len(scores)
print(threshold)
print
# cluster1 = []
# cluster2 = []
# cluster3 = []
# for i,j in zip(lda_corpus, noun_Docs):
# if len(i) > 0:
# if i[0][1] > threshold:
# cluster1.append(j)
# elif len(i)>1:
# if i[1][1] > threshold:
# cluster2.append(j)
# elif len(i) > 2:
# if i[2][1] > threshold:
# cluster3.append(j)
cluster1 = [j for i, j in zip(lda_corpus, noun_Docs) if i[0][1] > threshold]
cluster2 = [j for i, j in zip(lda_corpus, noun_Docs) if i[1][1] > threshold]
cluster3 = [j for i, j in zip(lda_corpus, noun_Docs) if i[2][1] > threshold]
# for i,j in zip(lda_corpus, noun_Docs):
# print(i)
print(cluster1)
# print(cluster2)
# print(cluster3)
目前没有回答
相关问题 更多 >
编程相关推荐