
2024-06-02 09:43:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个大的二进制文件(比如1 GB),由以下类型的串联元素组成:

| -------- element 1 --------| ------- element 2 -------- | ...
| length (uint32) | contents | length (uint32) | contents | ...




file = open('...', 'rb')
while True:
    element = array.array('B')
    element.fromfile(file, 4)
    length = int.from_bytes(element, 'big')
    file.seek(length - 4, 1)    # skip reading contents


file = open('...', 'rb')
while True:
    element = array.array('B')
    element.fromfile(file, 4)
    length = int.from_bytes(element, 'big')
    element.fromfile(file, length - 4)   # contents read but never used





  1. 有没有一个很好的方法可以动态地选择两个代码路径中比较好的一个?(即,完全读取或查找。)
  2. file.seek(..., 1)对缓冲区做了什么导致这样的减速?是否清除读取缓冲区?如果文件的那一部分已经在缓冲区中,应该忽略查找,不是吗?(当然,除了移动指针之外。)


Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)] on win32



In [1]: from io import BytesIO

In [2]: data = bytearray(int(10e6))

In [3]: stream = BytesIO(data)

In [4]: chunk_size = 1000

In [5]: chunk = bytearray(chunk_size)

In [6]: %%timeit -r1000 -n1
   ...: bytes_read = True
   ...: while bytes_read:
   ...:     bytes_read = stream.readinto(chunk)
The slowest run took 5150.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 1.54 µs per loop
--> Slowest run: 7.9 ms

In [7]: from math import ceil

In [8]: no_reps = ceil(len(data) / chunk_size)

In [9]: %%timeit -r1000 -n1
  ....: for i in range(no_reps):
  ....:    stream.seek(chunk_size)
1 loop, best of 1000: 4.4 ms per loop

In [10]: chunk_size = 1000000

In [11]: chunk = bytearray(chunk_size)

In [12]: %%timeit -r1000 -n1
   ....: bytes_read = True
   ....: while bytes_read:
   ....:     bytes_read = stream.readinto(chunk)
The slowest run took 3428.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 1.54 µs per loop
--> Slowest run: 5.3 ms

In [13]: no_reps = ceil(len(data) / chunk_size)

In [14]: %%timeit -r1000 -n1
   ....: for i in range(no_reps):
   ....:    stream.seek(chunk_size)
The slowest run took 36.69 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 6.67 µs per loop
--> Slowest run:  245 µs


In [15]: chunk_size = 1000

In [16]: chunk = bytearray(chunk_size)

In [17]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: bytes_read = True
   ....: while bytes_read:
   ....:     bytes_read = stream.readinto(chunk)
1 loop, best of 1000: 16.7 ms per loop

In [18]: no_reps = ceil(len(data) / chunk_size)

In [19]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: for i in range(no_reps):
   ....:    stream.seek(chunk_size)
1 loop, best of 1000: 15.4 ms per loop

In [20]: chunk_size = 1000000

In [21]: chunk = bytearray(chunk_size)

In [22]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: bytes_read = True
   ....: while bytes_read:
   ....:     bytes_read = stream.readinto(chunk)
1 loop, best of 1000: 13.7 ms per loop

In [23]: no_reps = ceil(len(data) / chunk_size)

In [24]: %%timeit -r1000 -n1
   ....: stream = BytesIO(data)
   ....: for i in range(no_reps):
   ....:    stream.seek(chunk_size)
1 loop, best of 1000: 9.98 ms per loop

Tags: ofinloopreaddatastreamsizebytes