如何使用pandas/sklearn删除停止短语/停止ngram（多单词字符串）？问题的回答

如何使用pandas/sklearn删除停止短语/停止ngram（多单词字符串）？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<code>TfidfVectorizer</code>允许自定义预处理器。您可以使用它进行任何必要的调整。在 例如，要从示例语料库中删除所有连续出现的“red”+“roses”标记（不区分大小写），请使用： <pre><code>import re from sklearn.feature_extraction import text cases = ["I like red roses as much as I like blue tulips.", "It would be quite unusual to see red tulips, but not RED ROSES", "It is almost impossible to find blue roses", "I like most red flowers, but roses are my favorite.", "Could you buy me some red roses?", "John loves the color red. Roses are Mary's favorite flowers."] # remove_stop_phrases() is our custom preprocessing function. def remove_stop_phrases(doc): # note: this regex considers "... red. Roses..." as fair game for removal. # if that's not what you want, just use ["red roses"] instead. stop_phrases= ["red(\s?\\.?\s?)roses"] for phrase in stop_phrases: doc = re.sub(phrase, "", doc, flags=re.IGNORECASE) return doc sw = text.ENGLISH_STOP_WORDS mod_vectorizer = text.TfidfVectorizer( ngram_range=(2,3), stop_words=sw, norm='l2', min_df=1, preprocessor=remove_stop_phrases # define our custom preprocessor ) dtm = mod_vectorizer.fit_transform(cases).toarray() vocab = np.array(mod_vectorizer.get_feature_names()) </code></pre> 现在，<code>vocab</code>删除了所有的<code>red roses</code>引用。在 ^{pr2}$ 更新（每个评论线程）： 要将所需的停止短语与自定义停止词一起传递给包装函数，请使用： <pre><code>desired_stop_phrases = ["red(\s?\\.?\s?)roses"] desired_stop_words = ['Could', 'buy'] def wrapper(stop_words, stop_phrases): def remove_stop_phrases(doc): for phrase in stop_phrases: doc = re.sub(phrase, "", doc, flags=re.IGNORECASE) return doc sw = text.ENGLISH_STOP_WORDS.union(stop_words) mod_vectorizer = text.TfidfVectorizer( ngram_range=(2,3), stop_words=sw, norm='l2', min_df=1, preprocessor=remove_stop_phrases ) dtm = mod_vectorizer.fit_transform(cases).toarray() vocab = np.array(mod_vectorizer.get_feature_names()) return vocab wrapper(desired_stop_words, desired_stop_phrases) </code></pre>

如何使用pandas/sklearn删除停止短语/停止ngram（多单词字符串）？

1 个回答

相关Python问题