如何将sklearn CountVectorizer与多个字符串一起使用？

2024-06-26 11:30:18 发布

您现在位置：Python中文网/ 问答频道 /正文

2146

网友

男 | 程序猿一只，喜欢编程写python代码。

我有一个字符串列表（10000）。有些字符串构成多个单词。我还有另外一张单子，里面有一些句子。我正在计算列表中每个字符串在每个句子中出现的次数。在

目前我正在使用sklearn的特征提取工具，因为当我们有10000个字符串和10000个句子要查找时，它的工作非常迅速。在

下面是我代码的简化版本。在

import numpy as np
from sklearn import feature_extraction

sentences = ["hi brown cow", "red ants", "fierce fish"]

listOfStrings = ["brown cow", "ants", "fish"]

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfStrings)
taggedSentences = cv.fit_transform(sentences).toarray()

taggedSentencesCutDown = taggedSentences > 0
# Here we get an array of tuples <sentenceIndex, stringIndexfromStringList>
taggedSentencesCutDown = np.column_stack(np.where(taggedSentencesCutDown))

此时，如果运行此命令，则输出如下：

^{pr2}$

我想要的是：

In [2]: taggedSentencesCutDown
Out[2]: array([[0,0], [1, 1], [2, 2]])

我当前使用的CountVectorizer表明它没有寻找多个单词字符串。有没有其他方法可以在不进入long for循环的情况下做到这一点。效率和时间对我的应用程序非常重要，因为我的列表都在10000个

谢谢

Tags：字符串 import 列表 np sentences sklearn 单词 feature

1条回答

网友

1楼 · 发布于 2024-06-26 11:30:18

我通过在countVector中使用n-grams参数来解决这个问题。在

如果我能在单词表中找到最大数量的单词，我可以把它设为n-gram的上限。在上面的例子中，它是“棕色奶牛”有两个。在

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfStrings,
       ngram_range=(1, 2))

如何将sklearn CountVectorizer与多个字符串一起使用？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将sklearn CountVectorizer与多个字符串一起使用？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >