带频率的ngram的Python列表问题的回答

带频率的ngram的Python列表

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

更新 自scikit学习0.14以来，格式已更改为： <pre><code>n_grams = CountVectorizer(ngram_range=(1, 5)) </code></pre> 完整示例： <pre><code>test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words." test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions." from sklearn.feature_extraction.text import CountVectorizer c_vec = CountVectorizer(ngram_range=(1, 5)) # input to fit_transform() should be an iterable with strings ngrams = c_vec.fit_transform([test_str1, test_str2]) # needs to happen after fit_transform() vocab = c_vec.vocabulary_ count_values = ngrams.toarray().sum(axis=0) # output n-grams for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True): print(ng_count, ng_text) </code></pre> 它输出以下内容（注意，单词<code>I</code>被删除不是因为它是一个停止字（不是），而是因为它的长度：<a href="https://stackoverflow.com/a/20743758/">https://stackoverflow.com/a/20743758/</a>）： <pre><code>> (3, u'to') > (3, u'from') > (2, u'ngrams') > (2, u'need') > (1, u'words') > (1, u'trigrams but need better solutions') > (1, u'trigrams but need better') ... </code></pre> 现在这应该/可能要简单得多，imo.你可以尝试<a href="https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc.to_bag_of_terms" rel="noreferrer">^{<cd2>}</a>，但有时这会带来一些复杂的问题，比如初始化一个Doc，它目前在v.0.6.2<a href="https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc" rel="noreferrer">as shown on their docs</a>中不起作用。<a href="https://stackoverflow.com/q/51431112/">If doc initialization worked as promised</a>，理论上，以下方法是可行的（但实际上不行）： <pre><code>test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words." test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions." import textacy # some version of the following line doc = textacy.Doc([test_str1, test_str2]) ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True) print(ngrams) </code></pre> 旧答案 <code>WordNGramAnalyzer</code>确实被弃用，因为scikit学习0.11。创建n-grams和获取术语频率现在组合在<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer" rel="noreferrer">sklearn.feature_extraction.text.CountVectorizer</a>中。您可以创建从1到5的所有n-grams，如下所示： <pre><code>n_grams = CountVectorizer(min_n=1, max_n=5) </code></pre> 更多的例子和信息可以在scikit learn关于<a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction" rel="noreferrer">text feature extraction</a>的文档中找到。

带频率的ngram的Python列表

1 个回答

相关Python问题