我正在使用Gensim LDA进行主题建模。我正在使用熊猫数据帧进行处理。但我犯了个错误
TypeError: decoding to str: need a bytes-like object, Series found
我只需要使用Pandas处理数据,输入数据就像(一行)
PMID Text
12755608 The DNA complexation and condensation properties
12755609 Three proteins namely protective antigen PA edition
12755610 Lecithin retinol acyltransferase LRAT catalyze
我的密码是
data = pd.read_csv("h1.csv", delimiter = "\t")
data = data.dropna(axis=0, subset=['Text'])
data['Index'] = data.index
data["Text"] = data['Text'].str.replace('[^\w\s]','')
data.head()
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token):
result.append(lemmatize_stemming(token))
return result
input_data = data.Text.str.strip().str.split('[\W_]+')
print('\n\n tokenized and lemmatized document: ')
print(preprocess(input_data))
试试这个
相关问题 更多 >
编程相关推荐