使用TFIDF s嵌入单词的平均值

2024-09-29 22:20:48 发布

您现在位置：Python中文网/ 问答频道 /正文

12719

网友

男 | 程序猿一只，喜欢编程写python代码。

我一直在开发一个python脚本来分类一篇文章是否与正文相关。为此，我一直在使用ML（SVM分类器）和一些特性，包括单词嵌入的平均值。在

计算物品列表和正文之间单词嵌入平均值的代码如下：

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
setm = set(word2vec_model.index2word)

def avg_feature_vector(words, model, num_features, index2word_set):
        #function to average all words vectors in a given paragraph 
        featureVec = np.zeros((num_features,), dtype="float32")
        nwords = 0
        for word in words:
            if word in index2word_set and word not in stop:
                try:
                    featureVec = np.add(featureVec, model[word])
                    nwords = nwords+1
                except:
                    pass
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

def doc_similatiry(headlines, bodies):
    X = []
    docs = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        headline_avg_vector = avg_feature_vector(lemmatize_str(clean(headline)).split(), word2vec_model, 300, setm)
        body_avg_vector = avg_feature_vector(lemmatize_str(clean(body)).split(), word2vec_model, 300, setm)
        similarity =  1 - distance.cosine(headline_avg_vector, body_avg_vector)
        X.append(similarity)
    return X, docs

似乎word2vec的平均值计算正确。然而，它的分数比TF-IDF的余弦更差。因此，我的想法是将这两个特征分组，即将每个单词的TF-IDF分数乘以word2vec。在

我的代码如下：

^{pr2}$

我的问题是这个方法得到了糟糕的结果，我不知道是否有某种逻辑可以解释这一点（因为理论上它应该有更好的结果），或者我在代码中做了什么错误。在

有人能帮我弄清楚吗？另外，我愿意接受新的解决方案来解决这个问题。在

注意：这里有一些函数我没有发布代码，因为我认为它们是不必要的。如果你有什么不明白的地方，我会在这里解释清楚的。在

Tags：代码 in model body word2vec 单词 word avg

0条回答

目前没有回答

使用TFIDF s嵌入单词的平均值

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用TFIDF s嵌入单词的平均值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >