使用Pandas Datafram在Gensim LDA中处理数据时出错

2024-10-16 22:24:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Gensim LDA进行主题建模。我正在使用熊猫数据帧进行处理。但我犯了个错误

TypeError: decoding to str: need a bytes-like object, Series found

我只需要使用Pandas处理数据,输入数据就像(一行)

 PMID           Text
12755608    The DNA complexation and condensation properties
12755609    Three proteins namely protective antigen PA edition
12755610    Lecithin retinol acyltransferase LRAT catalyze

我的密码是

data = pd.read_csv("h1.csv", delimiter = "\t")
data = data.dropna(axis=0, subset=['Text'])
data['Index'] = data.index
data["Text"] = data['Text'].str.replace('[^\w\s]','')
data.head()

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token):
            result.append(lemmatize_stemming(token))
    return result


input_data = data.Text.str.strip().str.split('[\W_]+')
print('\n\n tokenized and lemmatized document: ')
print(preprocess(input_data))

Tags: andcsv数据textintokendatareturn
1条回答
网友
1楼 · 发布于 2024-10-16 22:24:24

试试这个

def preprocess(text):
   result = []
   for token in gensim.utils.simple_preprocess(text):
      if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2:
      result.append(token)
return result

doc_processed = input_data['Text'].map(preprocess)

dictionary = corpora.Dictionary(doc_processed)
#to prepapre a document term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_processed]

#Lda model
Lda = gensim.models.ldamodel.LdaModel
#Lda model to get the num_topics, number of topic requires, 
#passses is number training do you want to perform
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=2)
result=ldamodel.print_topics(num_topics=5, num_words=15)

相关问题 更多 >