“int”类型的拟合分类器对象没有len()

2024-10-01 04:45:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我希望你能接受我的问题。如果不清楚请告诉我。我想说很多细节。但目前还不清楚。如果是,请告诉我。在

我们有LDA主题建模,其目的是生成一些给定主题的文档集。 所以每个文档可以属于不同的主题。在

此外,我们还可以评估我们创建的模型。其中一种方法是使用SVM等分类方法。我的目标是评估创建的模型。

我遇到了两种用于生成LDA model的代码。在

1.

    # generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

这样我就不能使用Fit_transform

2.

^{pr2}$

在第一种方法中,LDA模型没有fit_变换方法,我不知道为什么,因为我不明白它们之间的区别。在

无论如何,我需要传递我用第一种方法创建的LDA模型来支持向量机(我把这两种方法放在这里的原因是我知道第二种方法没有错误,可能是因为fit_变换,但由于某些原因我不能使用它), 这是我最后的密码:

import os
from gensim.models import ldamodel
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC


tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = {'a'}

# Create p_stemmer of class PorterStemmer
lines=[]
p_stemmer = PorterStemmer()
lisOfFiles=[x[2] for x in os.walk("data")]

fullPath = [x[0] for x in os.walk("data")]
for j in lisOfFiles[2]:
    with open(os.path.join(fullPath[2],j)) as f:
                    a=f.read()
                    lines.append(a)


for j in lisOfFiles[3]:
    with open(os.path.join(fullPath[3],j)) as f:
                    a=f.read()
                    lines.append(a)

for j in lisOfFiles[4]:
    with open(os.path.join(fullPath[4],j)) as f:
                    a=f.read()
                    lines.append(a)

# compile sample documents into a list
doc_set = lines
# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)

# generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

# Assigns the topics to the documents in corpus

dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]


#creating the labels
lda_corpus = lda[mm]
label_y=[]
for i in lda_corpus:
    new_y = []
    for l in i:
        sorted_labels = sorted(i, key=lambda z: z[0], reverse=True)
        if l[1] > 0.005:
            new_y.append(l[0])
        label_y.append(new_y)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_df=2,min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(lda, label_y)

正如您在代码中看到的,出于某些原因,我使用了第一种方法, 但在最后一行中,它引发了一个错误(object of type int has no len())。它似乎无法接受以这种方式创建的lda(我在想,因为这样我没有使用fit_变换) 如何用代码修复此错误?在

非常感谢您的耐心和帮助。在

这是完整堆栈错误:

/home/saria/tfwithpython3.6/bin/python /home/saria/PycharmProjects/TfidfLDA/test4.py
Using TensorFlow backend.
Traceback (most recent call last):
  File "/home/saria/PycharmProjects/TfidfLDA/test4.py", line 92, in <module>
    classifier.fit(lda, label_y)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
    self.fixed_vocabulary_)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 760, in _count_vocab
    for doc in raw_documents:
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 1054, in __getitem__
    return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 922, in get_document_topics
    gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 429, in inference
    if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
TypeError: object of type 'int' has no len()

Process finished with exit code 1

Tags: 方法textinfrompyimporthomefor