我希望你能接受我的问题。如果不清楚请告诉我。我想说很多细节。但目前还不清楚。如果是,请告诉我。在
我们有LDA主题建模,其目的是生成一些给定主题的文档集。 所以每个文档可以属于不同的主题。在
此外,我们还可以评估我们创建的模型。其中一种方法是使用SVM等分类方法。我的目标是评估创建的模型。
我遇到了两种用于生成LDA model
的代码。在
1.
# generate LDA model
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]
# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)
这样我就不能使用Fit_transform
2.
^{pr2}$在第一种方法中,LDA模型没有fit_变换方法,我不知道为什么,因为我不明白它们之间的区别。在
无论如何,我需要传递我用第一种方法创建的LDA模型来支持向量机(我把这两种方法放在这里的原因是我知道第二种方法没有错误,可能是因为fit_变换,但由于某些原因我不能使用它), 这是我最后的密码:
import os
from gensim.models import ldamodel
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = {'a'}
# Create p_stemmer of class PorterStemmer
lines=[]
p_stemmer = PorterStemmer()
lisOfFiles=[x[2] for x in os.walk("data")]
fullPath = [x[0] for x in os.walk("data")]
for j in lisOfFiles[2]:
with open(os.path.join(fullPath[2],j)) as f:
a=f.read()
lines.append(a)
for j in lisOfFiles[3]:
with open(os.path.join(fullPath[3],j)) as f:
a=f.read()
lines.append(a)
for j in lisOfFiles[4]:
with open(os.path.join(fullPath[4],j)) as f:
a=f.read()
lines.append(a)
# compile sample documents into a list
doc_set = lines
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# generate LDA model
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]
# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)
# Assigns the topics to the documents in corpus
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
#creating the labels
lda_corpus = lda[mm]
label_y=[]
for i in lda_corpus:
new_y = []
for l in i:
sorted_labels = sorted(i, key=lambda z: z[0], reverse=True)
if l[1] > 0.005:
new_y.append(l[0])
label_y.append(new_y)
classifier = Pipeline([
('vectorizer', CountVectorizer(max_df=2,min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(lda, label_y)
正如您在代码中看到的,出于某些原因,我使用了第一种方法,
但在最后一行中,它引发了一个错误(object of type int has no len()
)。它似乎无法接受以这种方式创建的lda
(我在想,因为这样我没有使用fit_变换)
如何用代码修复此错误?在
非常感谢您的耐心和帮助。在
这是完整堆栈错误:
/home/saria/tfwithpython3.6/bin/python /home/saria/PycharmProjects/TfidfLDA/test4.py
Using TensorFlow backend.
Traceback (most recent call last):
File "/home/saria/PycharmProjects/TfidfLDA/test4.py", line 92, in <module>
classifier.fit(lda, label_y)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 268, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 234, in _fit
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 760, in _count_vocab
for doc in raw_documents:
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 1054, in __getitem__
return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 922, in get_document_topics
gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 429, in inference
if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
TypeError: object of type 'int' has no len()
Process finished with exit code 1
目前没有回答
相关问题 更多 >
编程相关推荐