Gensim LDA文本分类

2024-10-01 15:41:37 发布

男 | 程序猿一只，喜欢编程写python代码。

我在这里提出我的问题是因为已经有一些关于如何在gensim中使用scikit方法的答案，比如scikit vectorizers with gensim或{a2}，但是我还没有看到用于文本分类的整个管道。我会尽量解释一下我的情况

我想使用gensim LDA实现的方法来进一步进行文本分类。我有一个数据集，由三部分组成（列车（25K）、试验（25K）和未标记数据（50K））。本文所要做的就是利用未标记的数据来学习潜在主题空间，然后将训练集和测试集转换成这个学习到的潜在主题空间。我目前正在使用Scikit Learn实现方法来提取弓表示。稍后，我将转换为LDA实现所需的输入，最后将训练和测试集转换为提取的潜在主题空间。最后，我将回到csr矩阵，以便拟合分类器并获得准确度。虽然在我看来一切都很好，但分类器的性能几乎为0%。我附加了部分代码是为了获得一些额外的帮助，或者如果有什么明显的东西我目前缺少。

#bow representations for the three sets unlabelled, train and test
vectorizer = CountVectorizer(max_features=3000,stop_words='english')


corpus_tfidf_unsuper = vectorizer.fit_transform(train_data_unsupervised[:,2])
corpus_tfidf_train = vectorizer.transform(train_ds[:,2])
corpus_tfidf_test= vectorizer.transform(test_ds[:,2])

#transform to gensim acceptable objects
vocab = vectorizer.get_feature_names()
id2word_unsuper=dict([(i, s) for i, s in enumerate(vocab)])
corpus_vect_gensim_unsuper = matutils.Sparse2Corpus(corpus_tfidf_unsuper.T)
corpus_vect_gensim_train = matutils.Sparse2Corpus(corpus_tfidf_train.T)
corpus_vect_gensim_test = matutils.Sparse2Corpus(corpus_tfidf_test.T)

#fit the model to the unlabelled data
lda = models.LdaModel(corpus_vect_gensim_unsuper, 
                  id2word = id2word_unsuper, 
                  num_topics = 10,
                  passes=1)
#transform the train and test set to the latent topic space
docTopicProbMat_train = lda[corpus_vect_gensim_train]
docTopicProbMat_test = lda[corpus_vect_gensim_test]
#transform to csr matrices
train_lda=matutils.corpus2csc(docTopicProbMat_train)
test_lda=matutils.corpus2csc(docTopicProbMat_test)
#fit the classifier and print the accuracy
clf =LogisticRegression()    
clf.fit(train_lda.transpose(), np.array(train_ds[:,0]).astype(int))     
ypred = clf.predict(test_lda.transpose())
print accuracy_score(test_ds[:,0].astype(int), ypred)

这是我的第一个帖子，所以如果有潜在的评论，请随时通知我。在

Tags： the to test ds transform train corpus fit

0条回答

目前没有回答

Gensim LDA文本分类

相关问题更多 >

编程相关推荐

热门问题

热门文章

Gensim LDA文本分类

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >