连续缓冲读和部分读加s之间的折衷

2024-06-02 09:43:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个大的二进制文件(比如1 GB),由以下类型的串联元素组成:

| -------- element 1 --------| ------- element 2 -------- | ...
| length (uint32) | contents | length (uint32) | contents | ...

内容当然是可变长度的。你知道吗

我正在扫描文件,以收集文件中所有元素的偏移量列表:即,我读取每个元素的长度字段,但丢弃内容。你知道吗

有两个选项(在简化的伪代码中):

file = open('...', 'rb')
while True:
    element = array.array('B')
    element.fromfile(file, 4)
    length = int.from_bytes(element, 'big')
    file.seek(length - 4, 1)    # skip reading contents

以及

file = open('...', 'rb')
while True:
    element = array.array('B')
    element.fromfile(file, 4)
    length = int.from_bytes(element, 'big')
    element.fromfile(file, length - 4)   # contents read but never used

令我惊讶的是,只读取内容并丢弃它们的选项比跳过读取内容的选项快30-50%。你知道吗

在我测试的特定情况下,元素的长度约为2kb,io.DEFAULT_BUFFER_SIZE为8kb。所以我经常在这里寻找一个已经被读取的文件的位置,这在一定程度上解释了速度减慢的原因。你知道吗

但是,对于元素比缓冲区大得多的文件,seek选项显然应该更快。你知道吗

问题:

  1. 有没有一个很好的方法可以动态地选择两个代码路径中比较好的一个?(即,完全读取或查找。)
  2. file.seek(..., 1)对缓冲区做了什么导致这样的减速?是否清除读取缓冲区?如果文件的那一部分已经在缓冲区中,应该忽略查找,不是吗?(当然,除了移动指针之外。)

环境:

Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)] on win32

更新1

与任何磁盘传输速度问题无关,似乎python流中的seek存在普遍的性能问题。即使流不是由磁盘而是由内存支持(因此完全是“缓存的”)。虽然流的重复读取似乎以某种方式被额外缓冲,但重复查找显然不是。你知道吗

In [1]: from io import BytesIO

In [2]: data = bytearray(int(10e6))

In [3]: stream = BytesIO(data)

In [4]: chunk_size = 1000

In [5]: chunk = bytearray(chunk_size)

In [6]: %%timeit -r1000 -n1
   ...: bytes_read = True
   ...: while bytes_read:
   ...:     bytes_read = stream.readinto(chunk)
   ...: 
The slowest run took 5150.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 1.54 µs per loop
--> Slowest run: 7.9 ms

In [7]: from math import ceil

In [8]: no_reps = ceil(len(data) / chunk_size)

In [9]: %%timeit -r1000 -n1
  ....: for i in range(no_reps):
  ....:    stream.seek(chunk_size)
  ....: 
1 loop, best of 1000: 4.4 ms per loop

In [10]: chunk_size = 1000000

In [11]: chunk = bytearray(chunk_size)

In [12]: %%timeit -r1000 -n1
   ....: bytes_read = True
   ....: while bytes_read:
   ....:     bytes_read = stream.readinto(chunk)
   ....: 
The slowest run took 3428.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 1.54 µs per loop
--> Slowest run: 5.3 ms

In [13]: no_reps = ceil(len(data) / chunk_size)

In [14]: %%timeit -r1000 -n1
   ....: for i in range(no_reps):
   ....:    stream.seek(chunk_size)
   ....: 
The slowest run took 36.69 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 6.67 µs per loop
--> Slowest run:  245 µs

当流是新的(即尚未被python读取)时,seek的缺点消失了:

In [15]: chunk_size = 1000

In [16]: chunk = bytearray(chunk_size)

In [17]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: bytes_read = True
   ....: while bytes_read:
   ....:     bytes_read = stream.readinto(chunk)
   ....: 
1 loop, best of 1000: 16.7 ms per loop

In [18]: no_reps = ceil(len(data) / chunk_size)

In [19]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: for i in range(no_reps):
   ....:    stream.seek(chunk_size)
   ....: 
1 loop, best of 1000: 15.4 ms per loop

In [20]: chunk_size = 1000000

In [21]: chunk = bytearray(chunk_size)

In [22]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: bytes_read = True
   ....: while bytes_read:
   ....:     bytes_read = stream.readinto(chunk)
   ....: 
1 loop, best of 1000: 13.7 ms per loop

In [23]: no_reps = ceil(len(data) / chunk_size)

In [24]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: for i in range(no_reps):
   ....:    stream.seek(chunk_size)
   ....: 
1 loop, best of 1000: 9.98 ms per loop

Tags: ofinloopreaddatastreamsizebytes