用正则表达式捕捉撇号

2024-06-28 20:30:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用Python的re模块捕获am中word color的所有修饰符。英语(AmE)和Br。英语(BrE)。我成功地捕捉到了几乎所有的单词,除了以撇号结尾的单词。这个问题是从瓦特的书开始的。你知道吗

以下是示例文本:

Red is a color.
His collar is too tight or too colouuuurful.
These are bright colours.
These are bright colors.
Calorific is a scientific term.
“Your life is very colorful,” she said.
color (U.S. English, singular noun)
colour (British English, singular noun)
colors (U.S. English, plural noun)
colours (British English, plural noun)
color’s (U.S. English, possessive singular)
colour’s (British English, possessive singular)
colors’ (U.S. English, possessive plural)
colours’ (British English, possessive plural)

这是我的正则表达式:\bcolou?r(?:[a-zA-Z’s]+)?\b

说明:

\b                 # Start at word boundary
colou?r            #u is optional for AmE
    (?:            #non-capturing group
    [a-zA-Z’s]+    #color could be followed by modifier (e.g.ful, or apostrophe)
    )?             #End non-capturing group; these letters are optional
\b                 # End at word boundary

问题是colors’colours’匹配到s。撇号被忽略。有人能解释一下我的代码有什么问题吗?我在SORegex Apostrophe how to match?上研究了这个问题,以及关于转义'"的问题。你知道吗

这是Regex101

提前谢谢。你知道吗


Tags: englishis单词arewordcolortoonoun
2条回答

问题是\b是单词边界,而对于...lors’和后面的空格之间的位置不是单词边界,因为和空格都不是单词字符。不要使用\b,而要使用lookahead来表示空格、句点、逗号或后面可能出现的任何内容:

\bcolou?r(?:[a-zA-Z’s]+)?(?=[ .,])

https://regex101.com/r/lB49Nr/3

问题是结尾\b。它的定义是

\b Matches, without consuming any characters, immediately between a character matched by \w and a character not matched by \w (in either order). It cannot be used to separate non words from words.

不在\w组中。 尝试删除结尾:\bcolou?r(?:[a-zA-Z’s]+)?

相关问题 更多 >