Ngram模型及其在NLTK中的困惑

import nltk print "... build" brown = nltk.corpus.brown corpus = [word.lower() for word in brown.words()] # Train on 95% f the corpus and test on the rest spl = 95*len(corpus)/100 train = corpus[:spl] test = corpus[spl:] # Remove rare words from the corpus fdist = nltk.FreqDist(w for w in train) vocabulary = set(map(lambda x: x[0], filter(lambda x: x[1] >= 5, fdist.iteritems()))) train = map(lambda x: x if x in vocabulary else "*unknown*", train) test = map(lambda x: x if x in vocabulary else "*unknown*", test) print "... train" from nltk.model import NgramModel from nltk.probability import LidstoneProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(5, train, estimator=estimator) print "len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) ) print "perplexity(test) =", lm.perplexity(test)

1条回答

网友

1楼 · 发布于 2024-05-19 01:04:50

你得到一个低困惑，因为你正在使用五角星模型。如果您使用bigram模型，您的结果将在50-1000（或大约5-10位）的更规则范围内。

根据你的评论，你是在使用NLTK-3.0alpha吗？你不应该，至少在语言建模方面不应该：

https://github.com/nltk/nltk/issues?labels=model

事实上，整个model模块已经从NLTK-3.0a4预发行版中删除，直到问题得到解决。

相关问题更多 >

编程相关推荐

热门问题

热门文章