使用Python NLTK的Kneser-Ney三元图平滑问题的回答

使用Python NLTK的Kneser-Ney三元图平滑

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我认为你误解了奈伊在计算什么。 来自<a href="https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing" rel="noreferrer">Wikipedia:</a> <blockquote> The normalizing constant λwi-1 has value chosen carefully to make the sum of conditional probabilities pKN(wi|wi-1) equal to one. </blockquote> 当然，我们在这里讨论的是大论，但对于高阶模型，同样的原理也是正确的。基本上，这句话的意思是，对于一个固定的上下文wi-1（或者更高阶模型的上下文），所有wi的概率必须加在一起。当你把所有样本的概率加起来时，你所做的就是包含多个上下文，这就是为什么你最终得到的“概率”大于1。如果保持上下文不变，如下面的代码示例中所示，则最终会得到一个数字&lt；=1。 <pre> <code> from nltk.util import ngrams from nltk.corpus import gutenberg gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS")) freq_dist = nltk.FreqDist(gut_ngrams) kneser_ney = nltk.KneserNeyProbDist(freq_dist) prob_sum = 0 for i in kneser_ney.samples(): if i[0] == "I" and i[1] == "confess": prob_sum += kneser_ney.prob(i) print "{0}:{1}".format(i, kneser_ney.prob(i)) print prob_sum </code> </pre> 基于NLTK-Gutenberg语料子集的输出如下。 <pre><code> (u'I', u'confess', u'.--'):0.00657894736842 (u'I', u'confess', u'what'):0.00657894736842 (u'I', u'confess', u'myself'):0.00657894736842 (u'I', u'confess', u'also'):0.00657894736842 (u'I', u'confess', u'there'):0.00657894736842 (u'I', u'confess', u',"'):0.0328947368421 (u'I', u'confess', u'that'):0.164473684211 (u'I', u'confess', u'"--'):0.00657894736842 (u'I', u'confess', u'it'):0.0328947368421 (u'I', u'confess', u';'):0.00657894736842 (u'I', u'confess', u','):0.269736842105 (u'I', u'confess', u'I'):0.164473684211 (u'I', u'confess', u'unto'):0.00657894736842 (u'I', u'confess', u'is'):0.00657894736842 0.723684210526 </code></pre> 这个和（.72）小于1的原因是，概率只计算出现在第一个单词是“I”而第二个单词是“忏悔”的语料库中的三元组。剩余的.28概率保留给在语料库中不跟在“I”和“忏悔”后面的wIs。这就是平滑的全部要点，从出现在语料库中的ngram到那些不出现的ngram重新分配一些概率质量，这样你就不会得到一堆0概率的ngram。 也不是这样 <pre><code> ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \ form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \ the beauty of the world, the paragon of animals!") </code></pre> 计算字符三联图？我认为这需要被标记化来计算单词三元组。

使用Python NLTK的Kneser-Ney三元图平滑

1 个回答

相关Python问题