nltk/re：尝试用regex标记时，没有要重复的内容

import nltk text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital computer or the gears of a cycle transmission as he does at the top of a mountain or in the petals of a flower. To think otherwise is to demean the Buddha...which is to demean oneself.""" sentence_re = r'''(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])''' toks = nltk.regexp_tokenize(text, sentence_re)

1条回答

网友

1楼 · 发布于 2024-10-17 06:18:04

您得到的错误与您量化了字符串锚定的$结尾有关。unescaped$是在字符串末尾匹配的零宽度断言。要匹配文字$，需要对其进行转义。你知道吗

表达式中的.字符也需要转义以匹配文字点。你知道吗

但是，-在[][.,;"'?():-_`]的character类中形成范围也有问题。要确保-与-匹配，请将它放在最后一个]之前的末尾。你知道吗

此外，似乎您希望匹配不包含下划线的单词（因为您将_放在最后一个字符类中）。因此，我建议减去_形成\w模式，并用[^\W_]+(?:-[^\W_]+)*替换\w+(?:-\w+)*。你知道吗

以下是我的建议实施的模式：

sentence_re = r'''\$?\d+(?:\.\d+)?%?|[A-Z](?:\.[A-Z])+\.?|[^\W_]+(?:-[^\W_]+)*|(?:\.{3}|)[][.,;"'?():_`-]'''

参见regex demo

相关问题更多 >

编程相关推荐

热门问题

热门文章