如何使用sklearn的count向量器进行矢量化和去量化?

2024-06-28 09:45:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我想将一些文本矢量化为相应的整数,然后将这些文本转换为其映射的整数,并使用新的输入整数[2,9,39,46,56,12,89,9]创建新的句子。在

我已经看到一些自定义函数可以用于此目的,但我想知道sklearn本身是否有这样的函数。在

from sklearn.feature_extraction.text import CountVectorizer

a=["""Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Morbi imperdiet mauris posuere, condimentum odio et, volutpat orci.
Curabitur sodales vulputate eros eu gravida. Sed pharetra imperdiet nunc et tempor.
Nullam lectus est, rhoncus vitae lacus at, fermentum aliquam metus.
Phasellus a sollicitudin tortor, non tempor nulla.
Etiam mattis felis enim, a malesuada ligula dignissim at.
Integer congue dolor ut magna blandit, lobortis consequat ante aliquam.
Nulla imperdiet libero eget lorem sagittis, eget iaculis orci dignissim. 
Phasellus sit amet sodales odio. Pellentesque commodo tempor risus, et tincidunt neque. 
Praesent et sem velit. Maecenas id risus sit amet ex convallis ultrices vel sed purus. 
Sed fringilla, leo quis congue sollicitudin, mauris nunc vehicula mi, et laoreet ligula 
urna et nulla. Nam sollicitudin urna sed dolor vehicula euismod. Mauris bibendum pulvinar
ornare. In suscipit sed mi ut posuere.
Proin egestas, nibh ut egestas mattis, ipsum nulla bibendum enim, ac suscipit nisl justo 
id metus. Nam est dui, elementum eget suscipit nec, aliquam in mi. Integer tortor erat,
aliquet at sapien et, fringilla posuere leo. Praesent non congue est. Vivamus tincidunt
tellus eu placerat tincidunt. Phasellus convallis lacus vitae ex congue efficitur.
Sed ut bibendum massa, vitae molestie ligula. Phasellus purus felis, fermentum vitae 
hendrerit vel, vulputate quis metus."""]


vec = CountVectorizer()
dtm=vec.fit_transform(a)
print vec.vocabulary_

#convert text to corresponding vectors
mapped_a=

#new sentence using below mapped values
#input [2,9,39,46,56,12,89,9]
#creating sentence using specific sequence

new_sentence=

Tags: 整数sedatetestutdoloramet
2条回答

要将句子矢量化为整数,可以使用transform函数。这个函数的输出是一个向量,每个项的计数-特征向量。在

vec = CountVectorizer()
vec.fit(a)
print vec.vocabulary_

new_sentence = "dolor nulla enim"
mapped_a = vec.transform([new_sentence])
print mapped_a.toarray() # sparse feature vector

tokenizer = vec.build_tokenizer()
# array of words ids
for token in tokenizer(new_sentence):
    print vec.vocabulary_.get(token)

问题的第二部分并不那么简单。CountVectorizer具有{}函数,以稀疏特征向量作为输入。然而,在您的例子中,您希望创建一个句子,其中可能出现相同的术语,而使用该函数则不可能。在

然而,词汇的建立是建立在词汇基础上的。CountVectorizer默认情况下没有inverse_vocabulary,您必须基于vocabulary创建它。在

^{pr2}$

看看sklearn中的预处理库,LabelEncoder和onehotcoder通常用于对分类变量进行编码。但不建议对全文进行编码!在

相关问题 更多 >