<p>我在这上面花了一段时间。希望这段代码能做到这一点。它使用zlib,没有外部调用。在</p>
<p><strong>gunzipchunks</strong>方法以块的形式读取压缩的gzip文件,这些文件可以迭代(generator)。在</p>
<p>gunziplines</strong>方法读取这些未压缩的块,并一次为您提供一行,该行也可以迭代(另一个生成器)。在</p>
<p>最后,<strong>gunziplinesconter</strong>方法提供了您要查找的内容。在</p>
<p>干杯!在</p>
<pre><code>import zlib
file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'
#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
f = open(file_name,'rb')
while True:
packet = f.read(chunk_size)
if not packet: break
to_do = inflator.unconsumed_tail + packet
while to_do:
decompressed = inflator.decompress(to_do, chunk_size)
if not decompressed:
to_do = None
break
yield decompressed
to_do = inflator.unconsumed_tail
leftovers = inflator.flush()
if leftovers: yield leftovers
f.close()
#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
for chunk in gunzipchunks(file_name):
chunk = "".join([leftovers,chunk])
while line_ending in chunk:
line, leftovers = chunk.split(line_ending,1)
yield line
chunk = leftovers
if leftovers: yield leftovers
def gunziplinescounter(file_name):
for counter,line in enumerate(gunziplines(file_name)):
if (counter % 1000000 != 0): continue
print "%12s: %10d" % ("checkpoint", counter)
print "%12s: %10d" % ("final result", counter)
print "DEBUG: last line: [%s]" % (line)
gunziplinescounter(file_name)
</code></pre>
<p>这应该比在超大文件上使用内置gzip模块快得多。在</p>