使用Scikitlearn CountVectoriz仅为同一行上的单词创建ngram(忽略换行符)

2024-10-01 09:18:01 发布

您现在位置:Python中文网/ 问答频道 /正文

在Python中使用scikit learn库时,我可以使用CountVectorizer来创建所需长度(例如2个单词)的ngram,如下所示:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

myString = 'This is a\nmultiline string'

countVectorizer = CountVectorizer(ngram_range=(2,2))
analyzer = countVectorizer.build_analyzer()

listNgramQuery = analyzer(myString)
NgramQueryWeights = nltk.FreqDist(listNgramQuery)

print(NgamQueryWeights.items())

打印:

^{pr2}$

从创建的is multilinengram中可以看到(默认情况下,stop worda被过滤掉),引擎不关心字符串中的换行符。在

如何修改创建ngrams的引擎以尊重字符串中的换行符,并且只创建其所有单词属于同一行文本的ngram?我的预期产出是:

dict_items([('multiline string', 1), ('this is', 1)])

我知道我可以通过将token_pattern=someRegex传递给CountVectorizer来修改标记器模式。此外,我在某个地方读到使用的默认regex是u'(?u)\\b\\w\\w+\\b'。不过,我认为这个问题更多的是关于ngram的创建,而不是关于记号赋予器的问题,因为问题不是在不考虑linebreak的情况下创建令牌,而是ngram。在


Tags: fromimportstringisitemssklearn单词analyzer
2条回答

您需要重载分析器,如described in the documentation。在

def bigrams_per_line(doc):
    for ln in doc.split('\n'):
        terms = re.findall(r'\w{2,}', ln)
        for bigram in zip(terms, terms[1:]):
            yield '%s %s' % bigram


cv = CountVectorizer(analyzer=bigrams_per_line)
cv.fit(['This is a\nmultiline string'])
print(cv.get_feature_names())
# ['This is', 'multiline string']

接受的答案很好,但是只找到bigram(正好由两个单词组成的标记)。为了将其推广到ngrams(正如我在问题中使用ngram_range=(min,max)参数的示例代码中所述),可以使用以下代码:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
import re
from itertools import tee, islice

# custom ngram analyzer function, matching only ngrams that belong to the same line
def ngrams_per_line(doc):

    # analyze each line of the input string seperately
    for ln in doc.split('\n'):

        # tokenize the input string (customize the regex as desired)
        terms = re.findall(u'(?u)\\b\\w+\\b', ln)

        # loop ngram creation for every number between min and max ngram length
        for ngramLength in range(minNgramLength, maxNgramLength+1):

            # find and return all ngrams
            # for ngram in zip(*[terms[i:] for i in range(3)]): <  solution without a generator (works the same but has higher memory usage)
            for ngram in zip(*[islice(seq, i, len(terms)) for i, seq in enumerate(tee(terms, ngramLength))]): # <  solution using a generator
                ngram = ' '.join(ngram)
                yield ngram

然后使用自定义分析器作为CountVectorizer的参数:

^{pr2}$

确保minNgramLengthmaxNgramLength的定义方式使ngrams_per_line函数知道它们(例如,通过声明它们为全局变量),因为它们不能作为参数传递给它(至少我不知道如何)。在

相关问题 更多 >