如何遍历由分号分隔的非常大的文本文件？

2条回答

网友

1楼 · 编辑于 2024-05-19 05:07:09

方法.readlines()将整个文件读入一个列表。对于7GB文件，这可能不可行。你知道吗

在添加的示例中，您可以使用mmap和正则表达式进行整个文件的正则表达式匹配，而无需加载整个文件：

import re 
import mmap 

with open(fn,'r+b') as f_in:
    mm = mmap.mmap(f_in.fileno(), 0)    
    for m in re.finditer('([^;]*)', mm):
        txt=m.group(1)
        if txt:
            print('|{}|'.format(txt))

例如，打印：

|AAAA|
|BBBBB
BB|
|CCC|
|
DDDDD
D
D|
|
EEEE|
|F|

网友

2楼 · 编辑于 2024-05-19 05:07:09

下面是一个“reader”对象，它将从文件中读取块（大小由您选择），并在找到块时发出分号分隔的项：

class MyReader:
    def __init__(self, handle, delim, read_size=512):
        self.handle = handle
        self.delim = delim
        self.read_size = read_size


    def __iter__(self):
        buffer = []
        while True:
            block = self.handle.read(self.read_size)
            if not block: break     # Reached EOF

            while block:
                (before, sep, block) = block.partition(self.delim)
                buffer.append(before)

                if sep:             # Separator was found, yield the buffer
                    yield ''.join(buffer)
                    buffer = []

        # We broke free, flush the buffer and return (explicit)
        yield ''.join(buffer)
        return

例如，您可以使用：

with open('file.txt') as f:
    reader = MyReader(f, ';')

    for chunk in reader:
        print(repr(chunk))

输出：

'AAAA'
'BBBBB\nBB'
'CCC'
'\nDDDDD\nD\nD'
'\nEEEE'
'F'

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何遍历由分号分隔的非常大的文本文件？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >