sk了解如何从txt-fi添加自定义停止字列表

1条回答

网友

1楼 · 发布于 2024-09-27 18:23:48

您可以加载特定停止字列表并将其作为参数传递给TfidfVectorizer。在您的例子中：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['Saya benci awak',
          'Saya cinta awak',
          'Saya x happy awak',
          'Saya geram awak',
          'Saya taubat awak']

# HERE YOU DO YOUR MAGIC: you open your file and load the list of STOP WORDS
stop_words = [unicode(x.strip(), 'utf-8') for x in open('stopword.txt','r').read().split('\n')]

vectorizer = TfidfVectorizer(analyzer='word', stop_words = stop_words)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

带停止字的输出：

^{pr2}$

无停止字输出参数：

{u'benci': 2.09861228866811, u'taubat': 2.09861228866811, u'saya': 1.0, u'awak': 1.0, u'geram': 2.09861228866811, u'cinta': 2.09861228866811, u'happy': 2.09861228866811}

Warning: I wouldn't use the param vocabulary because it is telling the TfidfVectorizer to only pay attention to the words specified in it and it's usually harder to be aware of all words that you want to take into account than saying the ones you want to dismiss. So, if you remove the vocabulary param from your example and you add the stop_words param with your list it will work as you expect.

相关问题更多 >

编程相关推荐

热门问题

热门文章

sk了解如何从txt-fi添加自定义停止字列表

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >