<p><code>TfidfVectorizer</code>允许自定义预处理器。您可以使用它进行任何必要的调整。在</p>
<p>例如,要从示例语料库中删除所有连续出现的“red”+“roses”标记(不区分大小写),请使用:</p>
<pre><code>import re
from sklearn.feature_extraction import text
cases = ["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]
# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
# note: this regex considers "... red. Roses..." as fair game for removal.
# if that's not what you want, just use ["red roses"] instead.
stop_phrases= ["red(\s?\\.?\s?)roses"]
for phrase in stop_phrases:
doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
return doc
sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=1,
preprocessor=remove_stop_phrases # define our custom preprocessor
)
dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
</code></pre>
<p>现在,<code>vocab</code>删除了所有的<code>red roses</code>引用。在</p>
^{pr2}$
<p><strong>更新</strong>(每个评论线程):</p>
<p>要将所需的停止短语与自定义停止词一起传递给包装函数,请使用:</p>
<pre><code>desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']
def wrapper(stop_words, stop_phrases):
def remove_stop_phrases(doc):
for phrase in stop_phrases:
doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
return doc
sw = text.ENGLISH_STOP_WORDS.union(stop_words)
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=1,
preprocessor=remove_stop_phrases
)
dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
return vocab
wrapper(desired_stop_words, desired_stop_phrases)
</code></pre>