在python中使用countVectorizer为我自己的词汇表计算单词出现率

2024-06-26 01:57:41 发布

您现在位置:Python中文网/ 问答频道 /正文

Doc1: ['And that was the fallacy. Once I was free to talk with staff members']

Doc2: ['In the new, stripped-down, every-job-counts business climate, these human']

Doc3 : ['Another reality makes emotional intelligence ever more crucial']

Doc4: ['The globalization of the workforce puts a particular premium on emotional']

Doc5: ['As business changes, so do the traits needed to excel. Data tracking']

这是我词汇的一个例子:

^{pr2}$

重点是我词汇中的每一个单词都是一个二元曲线或三元曲线。我的词汇表包括文档集中所有可能的二元曲线和三元曲线,我在这里给了你一个示例。基于应用程序这是我的声乐应该是。我尝试使用CountVector,如下所示:

from sklearn.feature_extraction.text import CountVectorizer
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer( vocabulary=my_vocabulary)
tf = vectorizer.fit_transform(doc_set) 

我希望得到这样的东西:

print tf:
(0, 126)    1
(0, 6804)   1
(0, 5619)   1
(0, 5019)   2
(0, 5012)   1
(0, 999)    1
(0, 996)    1
(0, 4756)   4

其中第一列是文档ID,第二列是词汇表中的单词ID,第三列是该单词在该文档中的出现次数。但是tf是空的。我知道在一天结束的时候,我可以编写一个代码,遍历词汇表中的所有单词,计算出现次数并生成矩阵,但是我可以使用countVectorizer来处理我的输入并节省时间吗?我是不是做错了什么?如果countVectorizer不是正确的方法,任何建议都将不胜感激。在


Tags: theto词汇表文档tfdoc1business单词
1条回答
网友
1楼 · 发布于 2024-06-26 01:57:41

通过在CountVectorizer中指定ngram_range参数,可以构建一个包含所有可能的bigram和tri-gram的词汇表。fit_transform之后,您可以使用get_feature_names()和toarray()方法查看词汇表和频率。后者为每个文档返回一个频率矩阵。更多信息:http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]

vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()

至于您所做的,如果您在词汇表上训练CountVectorizer,然后转换文档,这将是有效的。在

^{pr2}$

相关问题 更多 >