内存泄漏（ish？）使用re和mmap时

pattern = re.compile(b"PATTERN.{1,20}", re.DOTALL) f = open("file.bin", "rb") mem = mmap.map(f.fileno(), 0, access=mmap.ACCESS_READ) results = [] for match in pattern.finditer(mem): results.append(match.group(0)) f.close()

1条回答

网友

1楼 · 发布于 2024-10-01 07:12:20

我不确定有没有办法解决这个问题。您正在以磁盘所能提供的速度读取大量数据。除非你有大量的内存。如果你不在某个时候，你会用尽内存，必须释放一些。大多数操作系统将使用LRU（最近最少使用）算法来决定从RAM中取出什么。由于您正在尽可能快地访问数据，内存映射文件使用的大多数内存都将有最近的ish访问时间。因此，这意味着它们是被赶出RAM的“可怜的”候选者（至少根据操作系统）。在

基本上，当内存耗尽时，操作系统正在做出一个糟糕的选择：从RAM中取出什么。在

然而，你更清楚什么是记忆可以释放。因此，您可以将文件分块扫描。当您不再需要文件的早期部分时，这将显式地允许操作系统释放该内存。当然，这会在块的边界产生问题。在

例如，您可以采取哪些措施来提高程序的内存性能：

import re
import mmap
import os

filename = "some_file.txt"
file_size = os.stat(filename).st_size
chunk_size = 2**32
# chunk_size = 50 # smaller chunk_size I used for testing
regex = re.compile(rb"PATTERN\d{1,20}\n")
max_length = len("PATTERN") + 20 + len("\n")

matches = []
f = open(filename, "rb")    
for i in range(0, file_size, chunk_size - max_length + 1):
    # compute length of data to search over
    length = chunk_size if i + chunk_size <= file_size else file_size - i 

    m = mmap.mmap(f.fileno(), length=length, offset=i, access=mmap.ACCESS_READ)
    # f.seek(i) # used for testing
    # m = f.read(length)

    for match in regex.finditer(m):
        if not (match.end() == len(m) and len(match.group()) < max_length and length == chunk_size):
            # if match ends at end of string
            # and not maximum length of regex
            # but not also at the end of the file
            # THEN there *may* be a cross chunk-boundary match
            # THUS, defer match to next loop iteration
            matches.append(match.group())
    m.close()
f.close()

相关问题更多 >

编程相关推荐

热门问题

热门文章