一行一行地以相反顺序读取文本文件的方法？

3条回答

网友

1楼 · 编辑于 2024-05-02 03:38:48

Whlie@martineau的解决方案在没有将整个文件加载到内存的情况下完成了任务，但是它会浪费时间读取整个文件两次。你知道吗

一种可以说更有效的一次性方法是从文件末尾以相当大的块读入缓冲区，从缓冲区末尾查找下一个换行符（减去最后一个字符的尾随换行符），如果找不到，则向后查找并继续以块读入，并将块预先放入缓冲区，直到找到为止找到换行符。使用较大的块大小进行更有效的读取，只要它在内存限制内：

class ReversedTextReader:
    def __init__(self, file, chunk_size=50):
        self.file = file
        file.seek(0, 2)
        self.position = file.tell()
        self.chunk_size = chunk_size
        self.buffer = ''

    def __iter__(self):
        return self

    def __next__(self):
        if not self.position and not self.buffer:
            raise StopIteration
        chunk = self.buffer
        while True:
            line_start = chunk.rfind('\n', 0, len(chunk) - 1 - (chunk is self.buffer))
            if line_start != -1:
                break
            chunk_size = min(self.chunk_size, self.position)
            self.position -= chunk_size
            self.file.seek(self.position)
            chunk = self.file.read(chunk_size)
            if not chunk:
                line = self.buffer
                self.buffer = ''
                return line
            self.buffer = chunk + self.buffer
        line_start += 1
        line = self.buffer[line_start:]
        self.buffer = self.buffer[:line_start]
        return line

以便：

from io import StringIO

f = StringIO('''2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:
''')

for line in ReversedTextReader(f):
    print(line, end='')

输出：

2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr

网友

2楼 · 编辑于 2024-05-02 03:38:48

不，没有更好的办法了。根据定义，文件是某种基本数据类型的顺序组织。文本文件的类型是字符。您正试图对文件强制使用不同的组织，字符串之间用换行符分隔。你知道吗

因此，您必须读取文件，重新转换为所需的格式，然后以相反的顺序执行该组织。例如，如果你多次需要这个。。。以行的形式读取文件，将行存储为数据库记录，然后根据需要遍历记录。你知道吗

file接口只在一个方向上读取。您可以seek()到另一个位置，但是标准的I/O操作只适用于增加位置描述。你知道吗

为了使您的解决方案能够工作，您需要读入不能reverse文件描述符的隐式迭代器的整个文件。你知道吗

网友

3楼 · 编辑于 2024-05-02 03:38:48

这里有一种方法可以做到这一点，而不必一次将整个文件读入内存。它确实需要先读取整个文件，但只存储每行的起始位置。一旦知道了这一点，它就可以使用seek()方法以所需的任何顺序随机访问每一个。你知道吗

下面是使用输入文件的示例：

# Preprocess - read whole file and note where lines start.
# (Needs to be done in binary mode.)
with open('text_file.txt', 'rb') as file:
    offsets = [0]  # First line is always at offset 0.
    for line in file:
        offsets.append(file.tell())  # Append where *next* line would start.

# Now reread lines in file in reverse order.
with open('text_file.txt', 'rb') as file:
    for index in reversed(range(len(offsets)-1)):
        file.seek(offsets[index])
        size = offsets[index+1] - offsets[index]  # Difference with next.
        # Read bytes, convert them to a string, and remove whitespace at end.
        line = file.read(size).decode().rstrip()
        print(line)

输出：

2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr

更新

这里有一个版本可以做同样的事情，但是使用Python的^{}模块来memory-map文件，它应该通过利用OS/硬件的虚拟内存功能来提供更好的性能。你知道吗

这是因为，正如PyMOTW-3所说：

Memory-mapping typically improves I/O performance because it does not involve a separate system call for each access and it does not require copying data between buffers – the memory is accessed directly by both the kernel and the user application.

代码：

import mmap

with open('text_file.txt', 'rb') as file:
    with mmap.mmap(file.fileno(), length=0, access=mmap.ACCESS_READ) as mm_file:

        # First preprocess the file and note where lines start.
        # (Needs to be done in binary mode.)
        offsets = [0]  # First line is always at offset 0.
        for line in iter(mm_file.readline, b""):
            offsets.append(mm_file.tell())  # Append where *next* line would start.

        # Now process the lines in file in reverse order.
        for index in reversed(range(len(offsets)-1)):
            mm_file.seek(offsets[index])
            size = offsets[index+1] - offsets[index]  # Difference with next.
            # Read bytes, convert them to a string, and remove whitespace at end.
            line = mm_file.read(size).decode().rstrip()
            print(line)

相关问题更多 >

编程相关推荐

热门问题

热门文章