<p>您可以通过传递关键字参数<code>tokenizer</code><a href="https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/feature_extraction/text.py#L1143" rel="nofollow noreferrer">(doc-src)</a>来切换<code>TfidfVectorizer</code>的标记器</p>
<p>原稿如下:</p>
<pre><code>def build_tokenizer(self):
"""Return a function that splits a string into a sequence of tokens"""
if self.tokenizer is not None:
return self.tokenizer
token_pattern = re.compile(self.token_pattern)
return lambda doc: token_pattern.findall(doc)
</code></pre>
<p>所以让我们做一个函数,删除所有你不想要的单词组合。首先,让我们定义不需要的表达式:</p>
^{pr2}$
<p>函数需要如下所示:</p>
<pre><code>token_pattern_str = r"(?u)\b\w\w+\b"
def my_tokenizer(doc):
"""split a string into a sequence of tokens
and remove some words along the way."""
token_pattern = re.compile(token_pattern_str)
tokens = token_pattern.findall(doc)
for i in range(len(tokens)):
for expr in unwanted_expressions:
found = True
for j, word in enumerate(expr):
found = found and (tokens[i+j] == word)
if found:
tokens[i:i+len(expr)] = len(expr) * [None]
tokens = [x for x in tokens if x is not None]
return tokens
</code></pre>
<p>我自己并没有特别尝试过,但我以前已经关掉了标记器。效果很好。在</p>
<p>祝你好运:)</p>