无法从NLTK库导入Bigrams

2024-06-30 15:49:31 发布

您现在位置:Python中文网/ 问答频道 /正文

让我困惑的快速问题。我已经安装了NLTK,它一直运行良好。不过,我试图得到一个语料库的大字集,并希望基本上使用大字集。。但是它说我“从nltk导入bigrams”时没有定义bigrams

三联图也一样。我遗漏了什么吗?另外,我怎样才能从语料库中手动获取大图。

我也在寻找计算大图三联图和四元图的频率,但不确定具体如何进行。

我已经用"<s>""</s>"标记了语料库,并在开始和结束处进行了适当的标记。到目前为止的计划是:

 #!/usr/bin/env python
import re
import nltk
import nltk.corpus as corpus
import tokenize
from nltk.corpus import brown

def alter_list(row):
    if row[-1] == '.':
        row[-1] = '</s>'
    else:
        row.append('</s>')
    return ['<s>'] + row

news = corpus.brown.sents(categories = 'editorial')
print len(news),'\n'

x = len(news)
for row in news[:x]:
    print(alter_list(row))

Tags: 标记importlencorpuslistrownewsprint
1条回答
网友
1楼 · 发布于 2024-06-30 15:49:31

我在virtualenv中测试过它,它能工作:

In [20]: from nltk import bigrams

In [21]: bigrams('This is a test')
Out[21]: 
[('T', 'h'),
 ('h', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'a'),
 ('a', ' '),
 (' ', 't'),
 ('t', 'e'),
 ('e', 's'),
 ('s', 't')]

这是你唯一的错误吗?

顺便问一下,关于你的第二个问题:

from collections import Counter
In [44]: b = bigrams('This is a test')

In [45]: Counter(b)
Out[45]: Counter({('i', 's'): 2, ('s', ' '): 2, ('a', ' '): 1, (' ', 't'): 1, ('e', 's'): 1, ('h', 'i'): 1, ('t', 'e'): 1, ('T', 'h'): 1, (' ', 'i'): 1, (' ', 'a'): 1, ('s', 't'): 1})

用词:

In [49]: b = bigrams("This is a test".split(' '))

In [50]: b
Out[50]: [('This', 'is'), ('is', 'a'), ('a', 'test')]

In [51]: Counter(b)
Out[51]: Counter({('is', 'a'): 1, ('a', 'test'): 1, ('This', 'is'): 1})

这种文字分割显然是非常肤浅的,但取决于您的应用程序,它可能就足够了。显然,您可以使用nltk的tokenize,它要复杂得多。

为了完成你的最终目标,你可以这样做:

In [56]: d = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

In [56]: from nltk import trigrams
In [57]: tri = trigrams(d.split(' '))

In [60]: counter = Counter(tri)

In [61]: import random

In [62]: random.sample(counter, 5)
Out[62]: 
[('Ipsum', 'has', 'been'),
 ('industry.', 'Lorem', 'Ipsum'),
 ('Ipsum', 'passages,', 'and'),
 ('was', 'popularised', 'in'),
 ('galley', 'of', 'type')]

我删掉了输出,因为它不必要的大,但是你明白了。

相关问题 更多 >