我想从数据中删除所有德语停止词

import nltk nltk.download() from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer('german', ignore_stopwords=True) class StemmedCountVectorizer(CountVectorizer): def build_analyzer(self): analyzer = super(StemmedCountVectorizer, self).build_analyzer() return lambda doc: [stemmer.stem(w) for w in analyzer(doc)] stemmed_count_vect = StemmedCountVectorizer(stop_words='german') text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()), ('mnb', MultinomialNB(fit_prior=False))]) text_mnb_stemmed = text_mnb_stemmed.fit(X, y) predicted_mnb_stemmed = text_mnb_stemmed.predict(X) np.mean(predicted_mnb_stemmed == y)

1条回答

网友

1楼 · 发布于 2024-09-27 07:32:09

如果您只想从doc中删除德语停止字，那么您可以在CountVectorizer函数中传递stopword列表

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer


german_stop_words = stopwords.words('german')

vect = CountVectorizer(stop_words = german_stop_words) # Now use this in your pipeline

我不确定您的关注点是从相应的列中删除德语数据集，还是在向量化时希望排除德语的stopword。在

CountVectorizer不用于从相应的列中删除Stopword，它用于对语料库进行矢量化

如果您只想从数据帧中删除列中的停止字，可以简单地执行以下操作。。。在

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章