<p>我认为你误解了奈伊在计算什么。</p>
<p>来自<a href="https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing" rel="noreferrer">Wikipedia:</a></p>
<blockquote>
The normalizing constant λ<sub>w<sub>i-1</sub></sub>
has value chosen carefully to make the sum of conditional probabilities p<sub>KN</sub>(w<sub>i</sub>|w<sub>i-1</sub>) equal to one.
</blockquote>
<p>当然,我们在这里讨论的是大论,但对于高阶模型,同样的原理也是正确的。基本上,这句话的意思是,对于一个固定的上下文w<sub>i-1</sub>(或者更高阶模型的上下文),所有w<sub>i</sub>的概率必须加在一起。当你把所有样本的概率加起来时,你所做的就是包含多个上下文,这就是为什么你最终得到的“概率”大于1。如果保持上下文不变,如下面的代码示例中所示,则最终会得到一个数字<;=1。</p>
<pre>
<code>
from nltk.util import ngrams
from nltk.corpus import gutenberg
gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
if i[0] == "I" and i[1] == "confess":
prob_sum += kneser_ney.prob(i)
print "{0}:{1}".format(i, kneser_ney.prob(i))
print prob_sum
</code>
</pre>
<p>基于NLTK-Gutenberg语料子集的输出如下。</p>
<pre><code>
(u'I', u'confess', u'.--'):0.00657894736842
(u'I', u'confess', u'what'):0.00657894736842
(u'I', u'confess', u'myself'):0.00657894736842
(u'I', u'confess', u'also'):0.00657894736842
(u'I', u'confess', u'there'):0.00657894736842
(u'I', u'confess', u',"'):0.0328947368421
(u'I', u'confess', u'that'):0.164473684211
(u'I', u'confess', u'"--'):0.00657894736842
(u'I', u'confess', u'it'):0.0328947368421
(u'I', u'confess', u';'):0.00657894736842
(u'I', u'confess', u','):0.269736842105
(u'I', u'confess', u'I'):0.164473684211
(u'I', u'confess', u'unto'):0.00657894736842
(u'I', u'confess', u'is'):0.00657894736842
0.723684210526
</code></pre>
<p>这个和(.72)小于1的原因是,概率只计算出现在第一个单词是“I”而第二个单词是“忏悔”的语料库中的三元组。剩余的.28概率保留给在语料库中不跟在“I”和“忏悔”后面的w<sub>I</sub>s。这就是平滑的全部要点,从出现在语料库中的ngram到那些不出现的ngram重新分配一些概率质量,这样你就不会得到一堆0概率的ngram。</p>
<p>也不是这样</p>
<pre><code>
ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")
</code></pre>
<p>计算字符三联图?我认为这需要被标记化来计算单词三元组。</p>