使用count vectoriz快速查找PMI的方法

2024-10-01 00:25:07 发布

您现在位置:Python中文网/ 问答频道 /正文

首先,我要找到一个术语文档矩阵,即用文档数量维表示的术语。在

为了找到PMI,我找到双元组的计数,比如this is,双元组this和{}中的单个单词的计数,然后按照(4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))计算

有没有更快的方法来实现这个目标?我不太熟悉numpy或{}。在

请注意,我需要在列表bigramFeatures中找到每个可能的二元图的值

f4 = ['this is sentence1','not sentence1 becuase this is not sentence1','why this this is called this is sentence1, its always setence1','fourth time this is not sentene1']


Vcount = CountVectorizer(analyzer='word',ngram_range=(1,2),stop_words='english')
countMatrix = Vcount.fit_transform(f4)

# all unigrams and bigrams
feature_names = Vcount.get_feature_names()

#finding all bigrams
featureBigrams = [item for item in Vcount.get_feature_names() if len(item.split()) == 2 ]

#document term matrix
arrays = countMatrix.toarray()

#term document matrix
arrayTrans = arrays.transpose()

from collections import defaultdict
PMIMatrix = defaultdict(dict)

import math
import numpy
print len(featureBigrams)
i = 0
PMIMatrix = defaultdict(dict)
for item in featureBigrams:
    words = item.split()
    bigramLength = len(numpy.where(arrayTrans[feature_names.index(item)] > 0)[0])
    if bigramLength < 2:
        continue
    word0Length = len(numpy.where(arrayTrans[feature_names.index(words[0])] > 0)[0])
    word1Length = len(numpy.where(arrayTrans[feature_names.index(words[1])] > 0)[0])
    try:
        PMIMatrix[words[0]][words[1]] = (4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))
    except:
        print bigramLength,word0Length,word1Length

Tags: numpyloglennamesismaththisitem