如何使用pandas/sklearn删除停止短语/停止ngram（多单词字符串）？

3条回答

网友

1楼 · 编辑于 2024-09-22 20:33:41

TfidfVectorizer允许自定义预处理器。您可以使用它进行任何必要的调整。在

例如，要从示例语料库中删除所有连续出现的“red”+“roses”标记（不区分大小写），请使用：

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
         "It would be quite unusual to see red tulips, but not RED ROSES",
         "It is almost impossible to find blue roses",
         "I like most red flowers, but roses are my favorite.",
         "Could you buy me some red roses?",
         "John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
    # note: this regex considers "... red. Roses..." as fair game for removal.
    #       if that's not what you want, just use ["red roses"] instead.
    stop_phrases= ["red(\s?\\.?\s?)roses"]
    for phrase in stop_phrases:
        doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
    return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=1,
    preprocessor=remove_stop_phrases  # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

现在，vocab删除了所有的red roses引用。在

^{pr2}$

更新（每个评论线程）：

要将所需的停止短语与自定义停止词一起传递给包装函数，请使用：

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)

网友

2楼 · 编辑于 2024-09-22 20:33:41

在将df传递给mod_vectorizer之前，您应该使用类似以下命令：

df=["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]

df=[ i.lower() for i in df]
df=[i if 'red roses' not in i else i.replace('red roses','') for i in df]

如果您要检查的不止“红玫瑰”，请将上面最后一行替换为：

^{pr2}$

网友

3楼 · 编辑于 2024-09-22 20:33:41

您可以通过传递关键字参数tokenizer(doc-src)来切换TfidfVectorizer的标记器

原稿如下：

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)

所以让我们做一个函数，删除所有你不想要的单词组合。首先，让我们定义不需要的表达式：

^{pr2}$

函数需要如下所示：

token_pattern_str = r"(?u)\b\w\w+\b"
def my_tokenizer(doc):
    """split a string into a sequence of tokens
    and remove some words along the way."""

    token_pattern = re.compile(token_pattern_str)
    tokens = token_pattern.findall(doc)
    for i in range(len(tokens)):
        for expr in unwanted_expressions:
            found = True
            for j, word in enumerate(expr):
                found = found and (tokens[i+j] == word)
            if found:
                tokens[i:i+len(expr)] = len(expr) * [None]
    tokens = [x for x in tokens if x is not None]
    return tokens

我自己并没有特别尝试过，但我以前已经关掉了标记器。效果很好。在

祝你好运：）

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用pandas/sklearn删除停止短语/停止ngram（多单词字符串）？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >