搜索查询的TF*IDF

2024-06-23 03:38:27 发布

您现在位置：Python中文网/ 问答频道 /正文

692

网友

男 | 程序猿一只，喜欢编程写python代码。

好吧，所以我一直在TF*IDF上关注这两个帖子，但有点困惑：http://css.dzone.com/articles/machine-learning-text-feature

基本上，我想创建一个搜索查询，其中包含对多个文档的搜索。我想使用scikit学习工具包以及用于Python的NLTK库

问题是我不知道这两个TF*IDF向量是从哪里来的。我需要一个搜索查询和多个文档来搜索。我想我会根据每个查询计算每个文档的TF*IDF分数，找到它们之间的余弦相似度，然后通过按降序对分数进行排序来对它们进行排序。然而，代码似乎没有给出正确的向量。

每当我将查询减少到只有一个搜索时，它就会返回一个巨大的0列表，这真的很奇怪。

代码如下：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

Tags： the text from 文档 import tf transform train

1条回答

网友

1楼 · 发布于 2024-06-23 03:38:27

您将train_set和test_set定义为元组，但我认为它们应该是列表：

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query

使用这个代码似乎运行良好。

搜索查询的TF*IDF

相关问题更多 >

编程相关推荐

热门问题

热门文章

搜索查询的TF*IDF

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >