在python中使用countVectorizer为我自己的词汇表计算单词出现率

Doc1: ['And that was the fallacy. Once I was free to talk with staff members'] Doc2: ['In the new, stripped-down, every-job-counts business climate, these human'] Doc3 : ['Another reality makes emotional intelligence ever more crucial'] Doc4: ['The globalization of the workforce puts a particular premium on emotional'] Doc5: ['As business changes, so do the traits needed to excel. Data tracking']

1条回答

网友

1楼 · 发布于 2024-06-26 01:57:41

通过在CountVectorizer中指定ngram_range参数，可以构建一个包含所有可能的bigram和tri-gram的词汇表。fit_transform之后，您可以使用get_feature_names（）和toarray（）方法查看词汇表和频率。后者为每个文档返回一个频率矩阵。更多信息：http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]

vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()

至于您所做的，如果您在词汇表上训练CountVectorizer，然后转换文档，这将是有效的。在

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章