首先,我要找到一个术语文档矩阵,即用文档数量维表示的术语。在
为了找到PMI,我找到双元组的计数,比如this is
,双元组this
和{(4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))
计算
有没有更快的方法来实现这个目标?我不太熟悉numpy
或{
请注意,我需要在列表bigramFeatures
中找到每个可能的二元图的值
f4 = ['this is sentence1','not sentence1 becuase this is not sentence1','why this this is called this is sentence1, its always setence1','fourth time this is not sentene1']
Vcount = CountVectorizer(analyzer='word',ngram_range=(1,2),stop_words='english')
countMatrix = Vcount.fit_transform(f4)
# all unigrams and bigrams
feature_names = Vcount.get_feature_names()
#finding all bigrams
featureBigrams = [item for item in Vcount.get_feature_names() if len(item.split()) == 2 ]
#document term matrix
arrays = countMatrix.toarray()
#term document matrix
arrayTrans = arrays.transpose()
from collections import defaultdict
PMIMatrix = defaultdict(dict)
import math
import numpy
print len(featureBigrams)
i = 0
PMIMatrix = defaultdict(dict)
for item in featureBigrams:
words = item.split()
bigramLength = len(numpy.where(arrayTrans[feature_names.index(item)] > 0)[0])
if bigramLength < 2:
continue
word0Length = len(numpy.where(arrayTrans[feature_names.index(words[0])] > 0)[0])
word1Length = len(numpy.where(arrayTrans[feature_names.index(words[1])] > 0)[0])
try:
PMIMatrix[words[0]][words[1]] = (4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))
except:
print bigramLength,word0Length,word1Length
目前没有回答
相关问题 更多 >
编程相关推荐