将字典中的文本向量与字典中的键相关联

2024-09-29 09:34:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有从sqlite3数据库得到的文本。我想先用CountVectorizer得到一个文本向量来比较文本的相似性。我还有一个字典,在那里存储与messageID相关的文本(作为字典键)。如何将每个文本向量与其messageID相关联?e、 用一个像这样的向量数组

    [[1 1 0 1 1 0 1]
     [0 1 1 1 1 0 1]
     [0 1 0 1 1 1 1]]

我想知道messageID = 0有载体[1 1 0 1 1 0 1]。向量大小和数组的大小随着每一条新消息而增长。你知道吗

我试着把字典放到CountVectorizer中,试着只评估一条消息,但都不起作用。你知道吗

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cosineSimilarity


def getVectorsAndFeatures(strs):
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    vectors = vectorizer.transform(text).toarray()
    features = vectorizer.get_feature_names()
    return vectors, features


text = ['This is the first sentence', 'This is the second sentence',
        'This is the third sentence']
messageDict = {0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

vectors, features = getVectorsAndFeatures(text)

Tags: thetext文本消息字典is数组this
1条回答
网友
1楼 · 发布于 2024-09-29 09:34:11

按照您的示例,您有一个消息ID和句子之间的映射

>>> text = ['This is the first sentence', 'This is the second sentence',
 'This is the third sentence']
>>> message_map = dict(zip(range(len(text)), text))
>>> message_map
{0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

然后,您需要使用CountVectorizer计算文本功能在每个句子中出现的次数。您可以运行与以前相同的分析:

>>> vectorizer = CountVectorizer() 
>>> # Learn the vocabulary dictionary and return term-document matrixtransform 
>>> vectors = vectorizer.fit_transform(message_map.values()).toarray()
>>> vectors
array([[1, 1, 0, 1, 1, 0, 1],
       [0, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 1, 1, 1, 1]], dtype=int64)
>>> # get a mapping of the feature associated with each count entry
>>> features = vectorizer.get_feature_names()
>>> features
['first', 'is', 'second', 'sentence', 'the', 'third', 'this']

fit_transform()documentation中有:

fit_transform(self, raw_documents, y=None)

Parameters: raw_documents : iterable

An iterable which yields either str, unicode or file objects.

返回:X:array,[n\u个样本,n\u个特征]

Document-term matrix.

这意味着每个向量以相同的顺序对应于输入文本中的一个句子(即message_map.values())。如果要将ID映射到每个向量,只需执行以下操作(注意顺序是保留的):

>>> vector_map = dict(zip(message_map.keys(), vectors.tolist()))
>>> vector_map
{0: [1, 1, 0, 1, 1, 0, 1], 1: [0, 1, 1, 1, 1, 0, 1], 2: [0, 1, 0, 1, 1, 1, 1]}

我相信你要问的是,拟合一个语料库,然后把新句子转换成一个特征向量。但是请注意,任何不在原始语料库中的新词都将被忽略,如本例所示:

from sklearn.feature_extraction.text import CountVectorizer

corpus= ['This is the first sentence', 'This is the second sentence']
vectorizer = CountVectorizer() 
vectorizer.fit(corpus)

message_map = {0:'This is the first sentence', 1:'This is the second sentence', 2:'This is the third sentence'}

vector_map = { k: vectorizer.transform([v]).toarray().tolist()[0] for k, v in message_map.items()}

您将获得:

>>> vector_map
{0: [1, 1, 0, 1, 1, 1], 1: [0, 1, 1, 1, 1, 1], 2: [0, 1, 0, 1, 1, 1]}

请注意,现在比以前少了一个特性,因为单词third不再是特性的一部分。你知道吗

>>> vectorizer.get_feature_names()
['first', 'is', 'second', 'sentence', 'the', 'this']

在计算向量之间的相似性时,这可能有点问题,因为您将丢弃有助于区分向量的单词。你知道吗

另一方面,您可以使用英语词典或其子集作为语料库,并将其放入vectorizer。然而,得到的向量将变得更加稀疏,并且再次,这可能导致比较向量的问题。但这取决于计算向量之间距离的方法。你知道吗

相关问题 更多 >