擅长:python、mysql、java
<p>您需要重载分析器,如<a href="http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes" rel="nofollow">described in the documentation</a>。在</p>
<pre><code>def bigrams_per_line(doc):
for ln in doc.split('\n'):
terms = re.findall(r'\w{2,}', ln)
for bigram in zip(terms, terms[1:]):
yield '%s %s' % bigram
cv = CountVectorizer(analyzer=bigrams_per_line)
cv.fit(['This is a\nmultiline string'])
print(cv.get_feature_names())
# ['This is', 'multiline string']
</code></pre>