<p><strong>更新</strong></p>
<p>自scikit学习0.14以来,格式已更改为:</p>
<pre><code>n_grams = CountVectorizer(ngram_range=(1, 5))
</code></pre>
<p>完整示例:</p>
<pre><code>test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(ngram_range=(1, 5))
# input to fit_transform() should be an iterable with strings
ngrams = c_vec.fit_transform([test_str1, test_str2])
# needs to happen after fit_transform()
vocab = c_vec.vocabulary_
count_values = ngrams.toarray().sum(axis=0)
# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
print(ng_count, ng_text)
</code></pre>
<p>它输出以下内容(注意,单词<code>I</code>被删除不是因为它是一个停止字(不是),而是因为它的长度:<a href="https://stackoverflow.com/a/20743758/">https://stackoverflow.com/a/20743758/</a>):</p>
<pre><code>> (3, u'to')
> (3, u'from')
> (2, u'ngrams')
> (2, u'need')
> (1, u'words')
> (1, u'trigrams but need better solutions')
> (1, u'trigrams but need better')
...
</code></pre>
<p>现在这应该/可能要简单得多,imo.你可以尝试<a href="https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc.to_bag_of_terms" rel="noreferrer">^{<cd2>}</a>,但有时这会带来一些复杂的问题,比如初始化一个Doc,它目前在v.0.6.2<a href="https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc" rel="noreferrer">as shown on their docs</a>中不起作用。<a href="https://stackoverflow.com/q/51431112/">If doc initialization worked as promised</a>,理论上,以下方法是可行的(但实际上不行):</p>
<pre><code>test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."
import textacy
# some version of the following line
doc = textacy.Doc([test_str1, test_str2])
ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
print(ngrams)
</code></pre>
<p><strong>旧答案</strong></p>
<p><code>WordNGramAnalyzer</code>确实被弃用,因为scikit学习0.11。创建n-grams和获取术语频率现在组合在<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer" rel="noreferrer">sklearn.feature_extraction.text.CountVectorizer</a>中。您可以创建从1到5的所有n-grams,如下所示:</p>
<pre><code>n_grams = CountVectorizer(min_n=1, max_n=5)
</code></pre>
<p>更多的例子和信息可以在scikit learn关于<a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction" rel="noreferrer">text feature extraction</a>的文档中找到。</p>