CountVectorizer不尊重正则表达式

2024-09-27 07:31:05 发布

男 | 程序猿一只，喜欢编程写python代码。

我使用以下代码获取文档术语矩阵：

from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

stemmer = SnowballStemmer("english", ignore_stopwords=True)


class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

stemmed_count_vect = StemmedCountVectorizer(stop_words='english', 
                                            ngram_range=(1,1), 
                                            token_pattern=r'\b\w+\b', 
                                            min_df=1, 
                                            max_df=0.6)

不过，我还是得到了这样的东西：

我该怎么解决这个问题

Tags：代码 from 文档 import build self df doc

1条回答

网友

1楼 · 发布于 2024-09-27 07:31:05

此模式token_pattern=r'\b\w+\b'表示它希望单词边界之间有一个或多个\w字符类的成员。这个角色类

[m]atches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

所以在我看来，你需要更少的字符类（省略数字作为开始）

CountVectorizer不尊重正则表达式

相关问题更多 >

编程相关推荐

热门问题

热门文章

CountVectorizer不尊重正则表达式

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >