from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['Saya benci awak',
'Saya cinta awak',
'Saya x happy awak',
'Saya geram awak',
'Saya taubat awak']
# HERE YOU DO YOUR MAGIC: you open your file and load the list of STOP WORDS
stop_words = [unicode(x.strip(), 'utf-8') for x in open('stopword.txt','r').read().split('\n')]
vectorizer = TfidfVectorizer(analyzer='word', stop_words = stop_words)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
Warning: I wouldn't use the param vocabulary because it is telling the TfidfVectorizer to only pay attention to the words specified in it and it's usually harder to be aware of all words that you want to take into account than saying the ones you want to dismiss. So, if you remove the vocabulary param from your example and you add the stop_words param with your list it will work as you expect.
您可以加载特定停止字列表并将其作为参数传递给
TfidfVectorizer
。在您的例子中:带停止字的输出:
^{pr2}$无停止字输出参数:
相关问题 更多 >
编程相关推荐