我有一组documents
类似:
D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."
以及一组words
类似的:
"sky","land","sea","water","sun","moon"
我想创建这样的矩阵:
x D1 D2 D3
sky tf-idf 0 tf-idf
land 0 0 0
sea 0 0 0
water 0 0 0
sun 0 tf-idf tf-idf
moon 0 0 0
类似于这里给出的示例表:http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html。在给定的链接中,它使用文档中的相同单词,但我需要使用前面提到的words
集合。
如果文档中存在特定的单词,则我将tf-idf
值放入矩阵,否则我将0
放入矩阵。
你知道我该如何建立这样的矩阵吗?Python将是最好的,但R也赞赏。
我正在使用以下代码,但不确定我是否做了正确的事情。我的代码是:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray
transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
#print
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
我得到的结果非常荒谬(值只有0
和1
,而我期望值介于0和1之间)。
[[ 0. 0. 1. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 0. 0.]
[ 1. 0. 0. 0.]]
我也向其他库开放以计算tf-idf
。我只想要一个正确的矩阵,我在上面提到过。
R解决方案可能如下所示:
我相信你想要的是
(如前所述,这不是一个测试集,而是一个词汇表。)
相关问题 更多 >
编程相关推荐