创建tfidf值矩阵

x D1 D2 D3 sky tf-idf 0 tf-idf land 0 0 0 sea 0 0 0 water 0 0 0 sun 0 tf-idf tf-idf moon 0 0 0

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents test_set = ["sky","land","sea","water","sun","moon"] #Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() #print 'Fit Vectorizer to train set', trainVectorizerArray #print 'Transform Vectorizer to test set', testVectorizerArray transformer.fit(trainVectorizerArray) #print #print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) #print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

2条回答

网友

1楼 · 编辑于 2024-06-23 03:12:42

R解决方案可能如下所示：

library(tm)
docs <- c(D1 = "The sky is blue.",
          D2 = "The sun is bright.",
          D3 = "The sun in the sky is bright.")
dict <- c("sky","land","sea","water","sun","moon")
mat <- TermDocumentMatrix(Corpus(VectorSource(docs)), 
                          control=list(weighting =  weightTfIdf, 
                                       dictionary = dict))
as.matrix(mat)[dict, ]
#         Docs
# Terms          D1        D2        D3
#   sky   0.5849625 0.0000000 0.2924813
#   land  0.0000000 0.0000000 0.0000000
#   sea   0.0000000 0.0000000 0.0000000
#   water 0.0000000 0.0000000 0.0000000
#   sun   0.0000000 0.5849625 0.2924813
#   moon  0.0000000 0.0000000 0.0000000

网友

2楼 · 编辑于 2024-06-23 03:12:42

我相信你想要的是

vectorizer = TfidfVectorizer(stop_words=stopWords, vocabulary=test_set)
matrix = vectorizer.fit_transform(train_set)

（如前所述，这不是一个测试集，而是一个词汇表。）

相关问题更多 >

编程相关推荐

热门问题

热门文章