如何使用pandas/sklearn删除停止短语/停止ngram(多单词字符串)?

2024-09-22 20:33:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我想防止某些短语进入我的模型。例如,我想阻止“红玫瑰”进入我的分析。我知道如何添加Adding words to scikit-learn's CountVectorizer's stop list中给出的单个停止词,方法是:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']

然而,这也会导致诸如“红郁金香”或“蓝玫瑰”之类的其他ngram检测不到。在

我正在构建一个TfidfVectorizer作为我的模型的一部分,我意识到我需要的处理可能必须在这一阶段之后进入,但我不确定如何做到这一点。在

我的最终目标是在一段文本上做主题建模。下面是我正在处理的代码段(几乎直接从https://de.dariah.eu/tatom/topic_model_python.html#index-0借用的):

^{pr2}$

编辑

示例数据帧(我尝试插入尽可能多的边缘情况),df:

   Content
0  I like red roses as much as I like blue tulips.
1  It would be quite unusual to see red tulips, but not RED ROSES
2  It is almost impossible to find blue roses
3  I like most red flowers, but roses are my favorite.
4  Could you buy me some red roses?
5  John loves the color red. Roses are Mary's favorite flowers.

Tags: to模型asitblueredarefavorite
3条回答

TfidfVectorizer允许自定义预处理器。您可以使用它进行任何必要的调整。在

例如,要从示例语料库中删除所有连续出现的“red”+“roses”标记(不区分大小写),请使用:

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
         "It would be quite unusual to see red tulips, but not RED ROSES",
         "It is almost impossible to find blue roses",
         "I like most red flowers, but roses are my favorite.",
         "Could you buy me some red roses?",
         "John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
    # note: this regex considers "... red. Roses..." as fair game for removal.
    #       if that's not what you want, just use ["red roses"] instead.
    stop_phrases= ["red(\s?\\.?\s?)roses"]
    for phrase in stop_phrases:
        doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
    return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=1,
    preprocessor=remove_stop_phrases  # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

现在,vocab删除了所有的red roses引用。在

^{pr2}$

更新(每个评论线程):

要将所需的停止短语与自定义停止词一起传递给包装函数,请使用:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)

在将df传递给mod_vectorizer之前,您应该使用类似以下命令:

df=["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]

df=[ i.lower() for i in df]
df=[i if 'red roses' not in i else i.replace('red roses','') for i in df]

如果您要检查的不止“红玫瑰”,请将上面最后一行替换为:

^{pr2}$

您可以通过传递关键字参数tokenizer(doc-src)来切换TfidfVectorizer的标记器

原稿如下:

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)

所以让我们做一个函数,删除所有你不想要的单词组合。首先,让我们定义不需要的表达式:

^{pr2}$

函数需要如下所示:

token_pattern_str = r"(?u)\b\w\w+\b"
def my_tokenizer(doc):
    """split a string into a sequence of tokens
    and remove some words along the way."""

    token_pattern = re.compile(token_pattern_str)
    tokens = token_pattern.findall(doc)
    for i in range(len(tokens)):
        for expr in unwanted_expressions:
            found = True
            for j, word in enumerate(expr):
                found = found and (tokens[i+j] == word)
            if found:
                tokens[i:i+len(expr)] = len(expr) * [None]
    tokens = [x for x in tokens if x is not None]
    return tokens

我自己并没有特别尝试过,但我以前已经关掉了标记器。效果很好。在

祝你好运:)

相关问题 更多 >