我有一个大的二进制文件(比如1 GB),由以下类型的串联元素组成:
| -------- element 1 --------| ------- element 2 -------- | ...
| length (uint32) | contents | length (uint32) | contents | ...
内容当然是可变长度的。你知道吗
我正在扫描文件,以收集文件中所有元素的偏移量列表:即,我读取每个元素的长度字段,但丢弃内容。你知道吗
有两个选项(在简化的伪代码中):
file = open('...', 'rb')
while True:
element = array.array('B')
element.fromfile(file, 4)
length = int.from_bytes(element, 'big')
file.seek(length - 4, 1) # skip reading contents
以及
file = open('...', 'rb')
while True:
element = array.array('B')
element.fromfile(file, 4)
length = int.from_bytes(element, 'big')
element.fromfile(file, length - 4) # contents read but never used
令我惊讶的是,只读取内容并丢弃它们的选项比跳过读取内容的选项快30-50%。你知道吗
在我测试的特定情况下,元素的长度约为2kb,io.DEFAULT_BUFFER_SIZE
为8kb。所以我经常在这里寻找一个已经被读取的文件的位置,这在一定程度上解释了速度减慢的原因。你知道吗
但是,对于元素比缓冲区大得多的文件,seek选项显然应该更快。你知道吗
问题:
file.seek(..., 1)
对缓冲区做了什么导致这样的减速?是否清除读取缓冲区?如果文件的那一部分已经在缓冲区中,应该忽略查找,不是吗?(当然,除了移动指针之外。)环境:
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)] on win32
更新1
与任何磁盘传输速度问题无关,似乎python流中的seek存在普遍的性能问题。即使流不是由磁盘而是由内存支持(因此完全是“缓存的”)。虽然流的重复读取似乎以某种方式被额外缓冲,但重复查找显然不是。你知道吗
In [1]: from io import BytesIO
In [2]: data = bytearray(int(10e6))
In [3]: stream = BytesIO(data)
In [4]: chunk_size = 1000
In [5]: chunk = bytearray(chunk_size)
In [6]: %%timeit -r1000 -n1
...: bytes_read = True
...: while bytes_read:
...: bytes_read = stream.readinto(chunk)
...:
The slowest run took 5150.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 1.54 µs per loop
--> Slowest run: 7.9 ms
In [7]: from math import ceil
In [8]: no_reps = ceil(len(data) / chunk_size)
In [9]: %%timeit -r1000 -n1
....: for i in range(no_reps):
....: stream.seek(chunk_size)
....:
1 loop, best of 1000: 4.4 ms per loop
In [10]: chunk_size = 1000000
In [11]: chunk = bytearray(chunk_size)
In [12]: %%timeit -r1000 -n1
....: bytes_read = True
....: while bytes_read:
....: bytes_read = stream.readinto(chunk)
....:
The slowest run took 3428.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 1.54 µs per loop
--> Slowest run: 5.3 ms
In [13]: no_reps = ceil(len(data) / chunk_size)
In [14]: %%timeit -r1000 -n1
....: for i in range(no_reps):
....: stream.seek(chunk_size)
....:
The slowest run took 36.69 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 1000: 6.67 µs per loop
--> Slowest run: 245 µs
当流是新的(即尚未被python读取)时,seek的缺点消失了:
In [15]: chunk_size = 1000
In [16]: chunk = bytearray(chunk_size)
In [17]: %%timeit -r1000 -n1
....: stream = BytesIO(data)
....: bytes_read = True
....: while bytes_read:
....: bytes_read = stream.readinto(chunk)
....:
1 loop, best of 1000: 16.7 ms per loop
In [18]: no_reps = ceil(len(data) / chunk_size)
In [19]: %%timeit -r1000 -n1
....: stream = BytesIO(data)
....: for i in range(no_reps):
....: stream.seek(chunk_size)
....:
1 loop, best of 1000: 15.4 ms per loop
In [20]: chunk_size = 1000000
In [21]: chunk = bytearray(chunk_size)
In [22]: %%timeit -r1000 -n1
....: stream = BytesIO(data)
....: bytes_read = True
....: while bytes_read:
....: bytes_read = stream.readinto(chunk)
....:
1 loop, best of 1000: 13.7 ms per loop
In [23]: no_reps = ceil(len(data) / chunk_size)
In [24]: %%timeit -r1000 -n1
....: stream = BytesIO(data)
....: for i in range(no_reps):
....: stream.seek(chunk_size)
....:
1 loop, best of 1000: 9.98 ms per loop
目前没有回答
相关问题 更多 >
编程相关推荐