无法从NLTK库导入Bigrams

#!/usr/bin/env python import re import nltk import nltk.corpus as corpus import tokenize from nltk.corpus import brown def alter_list(row): if row[-1] == '.': row[-1] = '</s>' else: row.append('</s>') return ['<s>'] + row news = corpus.brown.sents(categories = 'editorial') print len(news),'\n' x = len(news) for row in news[:x]: print(alter_list(row))

1条回答

网友

1楼 · 发布于 2024-06-30 15:49:31

我在virtualenv中测试过它，它能工作：

In [20]: from nltk import bigrams

In [21]: bigrams('This is a test')
Out[21]: 
[('T', 'h'),
 ('h', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'a'),
 ('a', ' '),
 (' ', 't'),
 ('t', 'e'),
 ('e', 's'),
 ('s', 't')]

这是你唯一的错误吗？

顺便问一下，关于你的第二个问题：

from collections import Counter
In [44]: b = bigrams('This is a test')

In [45]: Counter(b)
Out[45]: Counter({('i', 's'): 2, ('s', ' '): 2, ('a', ' '): 1, (' ', 't'): 1, ('e', 's'): 1, ('h', 'i'): 1, ('t', 'e'): 1, ('T', 'h'): 1, (' ', 'i'): 1, (' ', 'a'): 1, ('s', 't'): 1})

用词：

In [49]: b = bigrams("This is a test".split(' '))

In [50]: b
Out[50]: [('This', 'is'), ('is', 'a'), ('a', 'test')]

In [51]: Counter(b)
Out[51]: Counter({('is', 'a'): 1, ('a', 'test'): 1, ('This', 'is'): 1})

这种文字分割显然是非常肤浅的，但取决于您的应用程序，它可能就足够了。显然，您可以使用nltk的tokenize，它要复杂得多。

为了完成你的最终目标，你可以这样做：

In [56]: d = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

In [56]: from nltk import trigrams
In [57]: tri = trigrams(d.split(' '))

In [60]: counter = Counter(tri)

In [61]: import random

In [62]: random.sample(counter, 5)
Out[62]: 
[('Ipsum', 'has', 'been'),
 ('industry.', 'Lorem', 'Ipsum'),
 ('Ipsum', 'passages,', 'and'),
 ('was', 'popularised', 'in'),
 ('galley', 'of', 'type')]

我删掉了输出，因为它不必要的大，但是你明白了。

相关问题更多 >

编程相关推荐

热门问题

热门文章