使用Pandas Datafram在Gensim LDA中处理数据时出错

PMID Text 12755608 The DNA complexation and condensation properties 12755609 Three proteins namely protective antigen PA edition 12755610 Lecithin retinol acyltransferase LRAT catalyze

data = pd.read_csv("h1.csv", delimiter = "\t") data = data.dropna(axis=0, subset=['Text']) data['Index'] = data.index data["Text"] = data['Text'].str.replace('[^\w\s]','') data.head() def lemmatize_stemming(text): return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v')) def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): if token not in gensim.parsing.preprocessing.STOPWORDS and len(token): result.append(lemmatize_stemming(token)) return result input_data = data.Text.str.strip().str.split('[\W_]+') print('\n\n tokenized and lemmatized document: ') print(preprocess(input_data))

1条回答

网友

1楼 · 发布于 2024-10-16 22:24:24

试试这个

def preprocess(text):
   result = []
   for token in gensim.utils.simple_preprocess(text):
      if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2:
      result.append(token)
return result

doc_processed = input_data['Text'].map(preprocess)

dictionary = corpora.Dictionary(doc_processed)
#to prepapre a document term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_processed]

#Lda model
Lda = gensim.models.ldamodel.LdaModel
#Lda model to get the num_topics, number of topic requires, 
#passses is number training do you want to perform
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=2)
result=ldamodel.print_topics(num_topics=5, num_words=15)

相关问题更多 >

编程相关推荐

热门问题

热门文章