from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams
In [34]: sentence = "I really like python, it's pretty awesome.".split()
In [35]: N = 4
In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]
In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']
其他用户给出的基于python的优秀本地答案。但是这里是
nltk
方法(以防万一,OP会因为重新创建nltk
库中已经存在的内容而受到惩罚)。有一个ngram module是人们很少在
nltk
中使用的。这并不是因为很难读取ngrams,而是在ngrams上训练一个模型,在该模型中n>;3将导致大量数据稀疏。我很惊讶这个还没有出现:
下面是另一种简单的do n-grams方法
相关问题 更多 >
编程相关推荐