Python：加速回复sub编辑文件（多重处理没有帮助）

2024-06-26 08:32:10 发布

您现在位置：Python中文网/ 问答频道 /正文

8710

网友

男 | 程序猿一只，喜欢编程写python代码。

我需要从几千个文件中删去某些单词。有一个单独的参考文件，大约50000字，需要从文件中删减。你知道吗

用我写的代码，这个过程将需要几个星期，我需要让它更快。你知道吗

import glob, re, sys
from multiprocessing.dummy import Pool as ThreadPool

def redact_file(file):
    with open(file, 'r') as myfile:
        data=myfile.read()

        for word in words_to_redact:
            search_term = r"(?:\b)"+word+r"(?:\b)"
            data = re.sub(search_term, '#', data, flags=re.IGNORECASE)  #this seems to be the slow bit?

    with open(file+'_REDACTED', 'w') as file:
        file.write(data)


if __name__ == "__main__":
    words_to_redact = []
    with open ("words_to_redact.txt") as myfile:    #about 50,000 rows in this reference file
        words_to_redact=myfile.read().splitlines()

    input_files = glob.glob("input_*.txt")

    pool = ThreadPool(multiprocessing.cpu_count()) 
    pool.map(redact_file, input_files)

使用多重处理似乎没有帮助。你知道吗

我认为性能问题来自于打电话回复sub每个文件50000次。因为每次迭代都会创建一个新的“data”字符串副本，所以我认为这个过程会受到内存/缓存速度的限制。你知道吗

我想我必须使用回复sub因为使用regEx是匹配单词的唯一方法。你知道吗

有办法吗回复sub每次都没有拷贝，或者用其他方法让它更快？你知道吗

Tags：文件 to re input data 过程 as with

1条回答

网友

1楼 · 发布于 2024-06-26 08:32:10

使用re.compile()编译模式一次，而不是每次执行搜索时
把你所有的话都放在一个大的模式里。

那么您的代码可能如下所示：

import re

words_to_redact = [ 'aa', 'bb', 'cc', etc...] # load 'em from file

patt = re.compile( r"(?:\b)(" + '|'.join( words_to_redact ) + r")(?:\b)" )

patt.sub( .. )  # you know what to do, need to call this only once (no loop)

Python：加速回复sub编辑文件（多重处理没有帮助）

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python：加速回复sub编辑文件（多重处理没有帮助）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >