单词标记化NLTK缩写问题

1条回答

网友

1楼 · 发布于 2024-06-14 16:44:37

NLTK regexp_tokenize模块使用正则表达式将字符串拆分为子字符串。可以定义一个regex pattern，它将构建一个与此模式中的组匹配的标记器。我们可以为您的特定用例编写一个模式，它查找单词、缩写（大写和小写）和符号，如'.'，';'等

import nltk
sent = "I am good. I e.g. wash the dishes."
pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Za-z]\.)+        # abbreviations(both upper and lower case, like "e.g.", "U.S.A.")
        | \w+(?:-\w+)*        # words with optional internal hyphens 
        | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''
nltk.regexp_tokenize(sent, pattern)
#Output:
['I', 'am', 'good', '.', 'I', 'e.g.', 'wash', 'the', 'dishes', '.']

缩写的Regex模式是(?:[A-Za-z]\.)+。\.与包含a-Z或a-Z字符的前向查找中的"."匹配

另一方面，在以下模式中，句号作为独立符号匹配，该模式不绑定到一组字母表中的正或负先行或包含：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

单词标记化NLTK缩写问题

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >