使用count vectoriz快速查找PMI的方法

2024-10-01 00:25:07 发布

男 | 程序猿一只，喜欢编程写python代码。

首先，我要找到一个术语文档矩阵，即用文档数量维表示的术语。在

为了找到PMI，我找到双元组的计数，比如this is，双元组this和{}中的单个单词的计数，然后按照(4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))计算

有没有更快的方法来实现这个目标？我不太熟悉numpy或{}。在

请注意，我需要在列表bigramFeatures中找到每个可能的二元图的值

f4 = ['this is sentence1','not sentence1 becuase this is not sentence1','why this this is called this is sentence1, its always setence1','fourth time this is not sentene1']


Vcount = CountVectorizer(analyzer='word',ngram_range=(1,2),stop_words='english')
countMatrix = Vcount.fit_transform(f4)

# all unigrams and bigrams
feature_names = Vcount.get_feature_names()

#finding all bigrams
featureBigrams = [item for item in Vcount.get_feature_names() if len(item.split()) == 2 ]

#document term matrix
arrays = countMatrix.toarray()

#term document matrix
arrayTrans = arrays.transpose()

from collections import defaultdict
PMIMatrix = defaultdict(dict)

import math
import numpy
print len(featureBigrams)
i = 0
PMIMatrix = defaultdict(dict)
for item in featureBigrams:
    words = item.split()
    bigramLength = len(numpy.where(arrayTrans[feature_names.index(item)] > 0)[0])
    if bigramLength < 2:
        continue
    word0Length = len(numpy.where(arrayTrans[feature_names.index(words[0])] > 0)[0])
    word1Length = len(numpy.where(arrayTrans[feature_names.index(words[1])] > 0)[0])
    try:
        PMIMatrix[words[0]][words[1]] = (4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))
    except:
        print bigramLength,word0Length,word1Length

Tags： numpy log len names is math this item

0条回答

目前没有回答

使用count vectoriz快速查找PMI的方法

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用count vectoriz快速查找PMI的方法

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >