sk了解如何从txt-fi添加自定义停止字列表

2024-09-27 18:23:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经用Sklearn做了TFIDF,但问题是我不能用英语单词来代替stopwords,因为我的母语是马来语(非英语)。我需要导入包含停止字列表的txt文件。在

在停止字.txt在

saya
cintakan
awak

在tfidf.py在

^{pr2}$

Tags: 文件pytxt列表sklearntfidf母语stopwords
1条回答
网友
1楼 · 发布于 2024-09-27 18:23:48

您可以加载特定停止字列表并将其作为参数传递给TfidfVectorizer。在您的例子中:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['Saya benci awak',
          'Saya cinta awak',
          'Saya x happy awak',
          'Saya geram awak',
          'Saya taubat awak']

# HERE YOU DO YOUR MAGIC: you open your file and load the list of STOP WORDS
stop_words = [unicode(x.strip(), 'utf-8') for x in open('stopword.txt','r').read().split('\n')]

vectorizer = TfidfVectorizer(analyzer='word', stop_words = stop_words)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

带停止字的输出:

^{pr2}$

无停止字输出参数:

{u'benci': 2.09861228866811, u'taubat': 2.09861228866811, u'saya': 1.0, u'awak': 1.0, u'geram': 2.09861228866811, u'cinta': 2.09861228866811, u'happy': 2.09861228866811}

Warning: I wouldn't use the param vocabulary because it is telling the TfidfVectorizer to only pay attention to the words specified in it and it's usually harder to be aware of all words that you want to take into account than saying the ones you want to dismiss. So, if you remove the vocabulary param from your example and you add the stop_words param with your list it will work as you expect.

相关问题 更多 >

    热门问题