带频率的ngram的Python列表

bigram_measures = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_words(words) finder.apply_freq_filter(3) finder.apply_word_filter(filter_stops) matches1 = finder.nbest(bigram_measures.pmi, 20)

4条回答

网友

1楼 · 编辑于 2024-10-02 12:33:19

更新

自scikit学习0.14以来，格式已更改为：

n_grams = CountVectorizer(ngram_range=(1, 5))

完整示例：

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

from sklearn.feature_extraction.text import CountVectorizer

c_vec = CountVectorizer(ngram_range=(1, 5))

# input to fit_transform() should be an iterable with strings
ngrams = c_vec.fit_transform([test_str1, test_str2])

# needs to happen after fit_transform()
vocab = c_vec.vocabulary_

count_values = ngrams.toarray().sum(axis=0)

# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
    print(ng_count, ng_text)

它输出以下内容（注意，单词I被删除不是因为它是一个停止字（不是），而是因为它的长度：https://stackoverflow.com/a/20743758/）：

> (3, u'to')
> (3, u'from')
> (2, u'ngrams')
> (2, u'need')
> (1, u'words')
> (1, u'trigrams but need better solutions')
> (1, u'trigrams but need better')
...

现在这应该/可能要简单得多，imo.你可以尝试^{}，但有时这会带来一些复杂的问题，比如初始化一个Doc，它目前在v.0.6.2as shown on their docs中不起作用。If doc initialization worked as promised，理论上，以下方法是可行的（但实际上不行）：

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

import textacy

# some version of the following line
doc = textacy.Doc([test_str1, test_str2])

ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
print(ngrams)

旧答案

WordNGramAnalyzer确实被弃用，因为scikit学习0.11。创建n-grams和获取术语频率现在组合在sklearn.feature_extraction.text.CountVectorizer中。您可以创建从1到5的所有n-grams，如下所示：

n_grams = CountVectorizer(min_n=1, max_n=5)

更多的例子和信息可以在scikit learn关于text feature extraction的文档中找到。

网友

2楼 · 编辑于 2024-10-02 12:33:19

看一下http://nltk.org/_modules/nltk/util.html我认为在幕后nltk.util.bigrams（）和nltk.util.trigrams（）是使用nltk.util.ngrams（）实现的

网友

3楼 · 编辑于 2024-10-02 12:33:19

如果你想生成原始的ngram（或者你自己计算一下），还有nltk.util.ngrams(sequence, n)。它将为任何值n生成一个ngram序列。它有填充选项，请参阅文档。

网友

4楼 · 编辑于 2024-10-02 12:33:19

如果你想生成原始的ngram（也许你自己计算一下），还有nltk.util.ngrams(sequence, n)。它将为任何值n生成一个ngram序列。它有填充选项，请参阅文档。

相关问题更多 >

编程相关推荐

热门问题

热门文章