<p>接受的答案很好,但是只找到bigram(正好由两个单词组成的标记)。为了将其推广到ngrams(正如我在问题中使用<code>ngram_range=(min,max)</code>参数的示例代码中所述),可以使用以下代码:</p>
<pre><code>from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
import re
from itertools import tee, islice
# custom ngram analyzer function, matching only ngrams that belong to the same line
def ngrams_per_line(doc):
# analyze each line of the input string seperately
for ln in doc.split('\n'):
# tokenize the input string (customize the regex as desired)
terms = re.findall(u'(?u)\\b\\w+\\b', ln)
# loop ngram creation for every number between min and max ngram length
for ngramLength in range(minNgramLength, maxNgramLength+1):
# find and return all ngrams
# for ngram in zip(*[terms[i:] for i in range(3)]): < solution without a generator (works the same but has higher memory usage)
for ngram in zip(*[islice(seq, i, len(terms)) for i, seq in enumerate(tee(terms, ngramLength))]): # < solution using a generator
ngram = ' '.join(ngram)
yield ngram
</code></pre>
<p>然后使用自定义分析器作为CountVectorizer的参数:</p>
^{pr2}$
<p>确保<code>minNgramLength</code>和<code>maxNgramLength</code>的定义方式使<code>ngrams_per_line</code>函数知道它们(例如,通过声明它们为全局变量),因为它们不能作为参数传递给它(至少我不知道如何)。在</p>