将字典中的文本向量与字典中的键相关联

from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity as cosineSimilarity def getVectorsAndFeatures(strs): text = [t for t in strs] vectorizer = CountVectorizer(text) vectorizer.fit(text) vectors = vectorizer.transform(text).toarray() features = vectorizer.get_feature_names() return vectors, features text = ['This is the first sentence', 'This is the second sentence', 'This is the third sentence'] messageDict = {0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'} vectors, features = getVectorsAndFeatures(text)

1条回答

网友

1楼 · 发布于 2024-09-29 09:34:11

按照您的示例，您有一个消息ID和句子之间的映射

>>> text = ['This is the first sentence', 'This is the second sentence',
 'This is the third sentence']
>>> message_map = dict(zip(range(len(text)), text))
>>> message_map
{0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

然后，您需要使用CountVectorizer计算文本功能在每个句子中出现的次数。您可以运行与以前相同的分析：

>>> vectorizer = CountVectorizer() 
>>> # Learn the vocabulary dictionary and return term-document matrixtransform 
>>> vectors = vectorizer.fit_transform(message_map.values()).toarray()
>>> vectors
array([[1, 1, 0, 1, 1, 0, 1],
       [0, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 1, 1, 1, 1]], dtype=int64)
>>> # get a mapping of the feature associated with each count entry
>>> features = vectorizer.get_feature_names()
>>> features
['first', 'is', 'second', 'sentence', 'the', 'third', 'this']

在fit_transform()documentation中有：

fit_transform(self, raw_documents, y=None)
Parameters: raw_documents : iterable
An iterable which yields either str, unicode or file objects.
返回：X:array，[n\u个样本，n\u个特征]
Document-term matrix.

这意味着每个向量以相同的顺序对应于输入文本中的一个句子（即message_map.values()）。如果要将ID映射到每个向量，只需执行以下操作（注意顺序是保留的）：

>>> vector_map = dict(zip(message_map.keys(), vectors.tolist()))
>>> vector_map
{0: [1, 1, 0, 1, 1, 0, 1], 1: [0, 1, 1, 1, 1, 0, 1], 2: [0, 1, 0, 1, 1, 1, 1]}

我相信你要问的是，拟合一个语料库，然后把新句子转换成一个特征向量。但是请注意，任何不在原始语料库中的新词都将被忽略，如本例所示：

from sklearn.feature_extraction.text import CountVectorizer

corpus= ['This is the first sentence', 'This is the second sentence']
vectorizer = CountVectorizer() 
vectorizer.fit(corpus)

message_map = {0:'This is the first sentence', 1:'This is the second sentence', 2:'This is the third sentence'}

vector_map = { k: vectorizer.transform([v]).toarray().tolist()[0] for k, v in message_map.items()}

您将获得：

>>> vector_map
{0: [1, 1, 0, 1, 1, 1], 1: [0, 1, 1, 1, 1, 1], 2: [0, 1, 0, 1, 1, 1]}

请注意，现在比以前少了一个特性，因为单词third不再是特性的一部分。你知道吗

>>> vectorizer.get_feature_names()
['first', 'is', 'second', 'sentence', 'the', 'this']

在计算向量之间的相似性时，这可能有点问题，因为您将丢弃有助于区分向量的单词。你知道吗

另一方面，您可以使用英语词典或其子集作为语料库，并将其放入vectorizer。然而，得到的向量将变得更加稀疏，并且再次，这可能导致比较向量的问题。但这取决于计算向量之间距离的方法。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章