Python文本文件处理速度问题

3条回答

网友

1楼 · 编辑于 2024-10-01 17:30:42

如果你搜索“为什么pythongzip很慢”，你会发现很多关于这个问题的讨论，包括python2.7和3.2中的改进补丁。同时，像在Perl中那样使用zcat，这非常快。你的（第一个）函数用了4.19秒，一个5MB的压缩文件，第二个函数用了0.78秒，但是，我不知道你的未压缩文件是怎么回事。如果我解压缩日志文件（apache日志）并在它们上运行两个函数，使用一个简单的Python open（file）和Popen（'cat'），Python比cat（0.48s）快（0.17s）。在

#!/usr/bin/python

import gzip
from subprocess import PIPE, Popen
import sys
import timeit

#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""

def test_ori():
    counter = 0
    f = gzip.open(pathToLog, 'r')
    for line in f:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line
    f.close

def test_new():
    counter = 0
    content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
    for line in content:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line

if '__main__' == __name__:
    to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
    print "Original function time", to.timeit(1)

    tn = timeit.Timer('test_new()', 'from __main__ import test_new')
    print "New function time", tn.timeit(1)

网友

2楼 · 编辑于 2024-10-01 17:30:42

在Python（至少<；=2.6.x）中，gzip格式解析是用Python（通过zlib）实现的。而且，它似乎在做一些奇怪的事情，即将解压到文件末尾到内存，然后丢弃超出请求的读取大小的所有内容（然后在下次读取时再次执行）。免责声明：我刚刚看了gzip.read()三分钟，所以我可能错了。不管我是否理解gzip.read（）是否正确，gzip模块似乎没有针对大数据量进行优化。尝试做与Perl中相同的事情，即启动一个外部进程（例如，参见模块subprocess）。在

编辑事实上，我错过了OP关于普通文件I/O和压缩文件一样慢的评论（多亏了ire_和_curses指出了这一点）。我觉得这不太可能，所以我做了一些测量。。。在

from timeit import Timer

def w(n):
    L = "*"*80+"\n"
    with open("ttt", "w") as f:
        for i in xrange(n) :
            f.write(L)

def r():
    with open("ttt", "r") as f:
        for n,line in enumerate(f) :
            if n % 1000000 == 0 :
                print n

def g():
    f = gzip.open("ttt.gz", "r")
    for n,line in enumerate(f) :
        if n % 1000000 == 0 :
        print n

现在，运行它。。。在

^{pr2}$

……喝茶休息后发现它还在运转，我就把它杀了，对不起。然后我试了10万行而不是1万行：

>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895

模块gzip的时间是O（文件大小**2），因此在行数达到数百万行的情况下，gzip读取时间不能与普通读取时间相同（正如我们通过实验所证实的那样）。匿名旅人，请再次确认。在

网友

3楼 · 编辑于 2024-10-01 17:30:42

我在这上面花了一段时间。希望这段代码能做到这一点。它使用zlib，没有外部调用。在

gunzipchunks方法以块的形式读取压缩的gzip文件，这些文件可以迭代（generator）。在

gunziplines方法读取这些未压缩的块，并一次为您提供一行，该行也可以迭代（另一个生成器）。在

最后，gunziplinesconter方法提供了您要查找的内容。在

干杯！在

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

这应该比在超大文件上使用内置gzip模块快得多。在

相关问题更多 >

编程相关推荐

热门问题

热门文章