如何在python中优化字符串替换?

2024-06-28 10:33:42 发布

您现在位置:Python中文网/ 问答频道 /正文

请允许我先说一句,这不是替换字符串中字母的重复。这是整个子串替换

我有一组文档,需要用空字符串或其他值替换几个不同的子字符串。 最快的方法是什么?有没有比使用正则表达式更快的方法

在逐字/逐字符中执行字符串替换时会有很大的区别,但是在这种情况下,除非使用某种形式的字符串匹配,否则这将不起作用

以下是我以前在这方面的尝试

import re 

def standard_for_loop(string, replacements):
    # case sensitive for loop
    for key, value in replacements.items():
        # would not work unless case specific
        string = string.replace(key, value)
        
    return string 


def regex_loop(string, replacements):
    #case insensitive regex substitution in for
    for key, value in replacements.items():
        string = re.sub(key, value, string, re.IGNORECASE)
        
    return string
    

def regex_multiple(string, replacements):
    # case insensitive regex substitution using lambda 
    pattern = re.compile("({})".format("|".join(replacements.keys())), re.IGNORECASE)
    return pattern.sub(lambda m: replacements[m.string.lower()[m.start():m.end()]], string)
    

    
def case_insensitive_for_loop(string, replacements):
    def find_next(string, pattern, sub):
        if pattern.lower() in string.lower():
            
            match = string.lower().index(pattern.lower())
            end = int(match + len(pattern))
            
            new_string = string[end:]
            
            # yield a replaced substring of original string
            yield string[:match] + sub
            yield from find_next(new_string, pattern, sub)
            
    '''
    # this is what I'm unsure about. How to negate need for 
    # for loop here and how to fix the append issue.
    # Currently the functionality works but it appends output 
    # replacement to the result. I know the "+=" is the 
    # cause of the problem, but I'm not sure how to fix this. 
    '''
    result = ''
    for k, v in replacements.items():
        for output in find_next(string, k, v):
            result += output
    return result

有两个问题,根据我的经验,regex_multiple是最准确的,但需要相当长的时间才能完成。第二个最准确的是case_sensitive_for_loop,但我不知道如何克服替换与追加的问题

例如,它将替换文档:

# for a sample document 

doc = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.

Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.

Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, venenatis ante. Proin convallis ex eu semper posuere. Etiam sed tincidunt massa. Vivamus aliquam mollis massa, nec lacinia est dictum vitae. In varius convallis pulvinar. Pellentesque aliquet pulvinar nibh vel dictum"""

#replacement strings where k is the substring to be searched and v is the value to be replaced with
repl = { 'venenatis ante':'', 
'ipsum nec volutpat dapibus' : '',
'ipsum vulputate accumsan' : '',
'dolor sit amet':'', 
'vivamus aliquam mollis massa':''
}

与:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.

Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.

Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.

Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.

Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, venenatis ante. Proin convallis ex eu semper posuere. Etiam sed tincidunt massa.

在比较了它们之后,standard_for_loop是最快的,在50k个循环中,每个循环4u秒。第二快的是regex_loop,在20k循环中每个循环14 u秒。然后是case_sensitive_for_loop,在10k循环中,每个循环花费28.3u秒。regex_multiple中的lambda表达式在2k个循环中以103u秒的速度完成所需时间最长

下面是python timeit为每个函数输出的python timeit outputs

想知道是否有任何字符串匹配算法,我已经否定,看看这个。欢迎有任何建议


Tags: inloopforstringnecnonipsumut
1条回答
网友
1楼 · 发布于 2024-06-28 10:33:42

regex_multiple效率低下,因为每次有匹配项时,if都会重新计算整个字符串。您只需降低匹配的字符串。以下是如何:

def regex_multiple(string, replacements):
    # case insensitive regex substitution using lambda 
    pattern = re.compile("({})".format("|".join(replacements.keys())), re.IGNORECASE)
    return pattern.sub(lambda m: replacements[m[0].lower()], string)

与其他不区分大小写的实现相比,此解决方案应该更快,并且在大型文档上比原始解决方案快得多

但是请注意,您正在比较区分大小写和不区分大小写的方法。使用不区分大小写的替换在计算上要密集得多,因此速度较慢。公平地说,你应该比较做同样事情的方法

最后,如果您处理ASCII文档。您可以将标志re.ASCII添加到regexp。这使得解析速度加快了一点

相关问题 更多 >