从稀疏转换为稠密时CountVectorizer内存不足

vect = CountVectorizer(vocabulary=list(word_to_index.keys()), tokenizer=lambda x: x.split()) X = vect.fit_transform(docs) X_arr = X.toarray() rel_freq = np.sum(X_arr, axis=0) / len(docs) names = vect.get_feature_names()

1条回答

网友

1楼 · 发布于 2024-10-03 09:18:25

如果只需要频率，可以使用稀疏矩阵的sum method求和：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus = ['This is the first document.','This is the second second document.',
'And the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)

X.sum(axis=0)/len(corpus)
matrix([[0.25, 0.75, 0.5 , 0.75, 0.25, 0.5 , 1.  , 0.25, 0.75]])

X.toarray().sum(axis=0)/ len(corpus)
array([0.25, 0.75, 0.5 , 0.75, 0.25, 0.5 , 1.  , 0.25, 0.75])

编程相关推荐

java我在删除项目时在recycler视图中有一个bug
java Android应用程序错误：
java在Reader中打开PDF并等待它退出
json Java Jackson，使用Map<String，Object>编组类，而不访问类代码库
java反向操作
java url包含特殊字符
javagooglecalendarapi:com。谷歌。应用程序编程接口。客户古格里皮斯。json。谷歌JSONResponseException
java无法在XAMPP中运行mysqldump
JUnit 5中的java参数化beforeach/beforeAll
基于Java的OnCreate或Buttons之外的安卓编辑文本视图

相关问题更多 >

编程相关推荐

热门问题

热门文章

从稀疏转换为稠密时CountVectorizer内存不足

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >