Python nltk计算单词和短语频率

from nltk.util import ngrams from nltk.tokenize import sent_tokenize, word_tokenize from nltk.collocations import * data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"] bigrams = ngrams(data, 2) bigrams_c = {} for b in bigrams: if b not in bigrams_c: bigrams_c[b] = 1 else: bigrams_c[b] += 1

2条回答

网友

1楼 · 编辑于 2024-05-17 08:09:19

既然您标记了这个nltk，下面介绍如何使用nltk的方法来完成此任务，这些方法比标准python集合中的方法具有更多的功能。

from nltk import ngrams, FreqDist
all_counts = dict()
for size in 2, 3, 4, 5:
    all_counts[size] = FreqDist(ngrams(data, size))

字典中的每个元素都是ngram频率的字典。例如，您可以得到五个最常见的三联图，如下所示：

all_counts[3].most_common(5)

网友

2楼 · 编辑于 2024-05-17 08:09:19

是的，不要运行这个循环，使用collections.Counter(bigrams)或pandas.Series(bigrams).value_counts()来计算一行中的计数。

相关问题更多 >

编程相关推荐

热门问题

热门文章