删除除缩写外的重复字母

aaa --> it is untouched because all are the same letters aa --> it is untouched because all are the same letters a --> not touched, just one letter broom --> brom school --> schol boo --> should be bo gool --> gol ooow --> should be ow

2条回答

网友

1楼 · 编辑于 2024-09-29 19:30:54

正则表达式不匹配boo，因为它搜索的重复项前后至少有一个不同的字符

一种可能是制作一个更简单的正则表达式来捕获所有重复项，然后在结果是一个字符时恢复

def remove_duplicate(string):
    new_string = re.sub(r'([a-zA-Z])\1+', r'\1', string)
    return new_string if len(new_string) > 1 else string

下面是一个不带正则表达式的可能解决方案。它的速度更快，但它也将删除重复的空白和标点符号。不仅仅是信件

def remove_duplicate(string):
    new_string = ''
    last_c = None
    for c in string:
        if c == last_c:
            continue
        else:
            new_string += c
            last_c = c
    if len(new_string) > 1:
        return new_string
    else:
        return string

网友

2楼 · 编辑于 2024-09-29 19:30:54

您可以将包含相同字符的整个单词匹配并捕获到一个捕获组中，然后在所有其他上下文中匹配重复的连续字母，并相应地替换：

import re
text = "aaa, aa, a,broom, school...boo, gool, ooow."
print( re.sub(r'\b(([a-zA-Z])\2+)\b|([a-zA-Z])\3+', r'\1\3', text) )
# => aaa, aa, a,brom, schol...bo, gol, ow.

见Python demo和regex demo

正则表达式详细信息

\b-单词边界
(([a-zA-Z])\2+)-组1：一个ASCII字母（捕获到组2中），然后出现一个或多个相同的字母
\b-单词边界
|-或
([a-zA-Z])-组3：捕获到组3中的ASCII字母
\3+-组3中捕获的字母的一次或多次出现

替换是组1和组3值的串联

要匹配任何Unicode字母，请将[a-zA-Z]替换为[^\W\d_]

相关问题更多 >

编程相关推荐

热门问题

热门文章