Python中的Tarfile：我可以通过只提取一些数据来更有效地解压吗？

fileName = "LT50250232011160-SC20140922132408.tar.gz" tfile = tarfile.open(fileName, 'r:gz') membersList = tfile.getmembers() namesList = tfile.getnames() bandsList = [x for x, y in zip(membersList, namesList) if "band" in y] print("extracting...") tfile.extractall("newfolder/",members=bandsList)

2条回答

网友

1楼 · 编辑于 2024-10-05 11:44:34

问题是tar文件没有中央文件列表，而是在每个文件之前按a header顺序存储文件。然后通过gzip压缩tar文件，得到tar.gz。对于tar文件，如果不想提取某个文件，只需跳过存档文件中的下一个header->size字节，然后读取下一个头。如果存档文件被额外压缩，那么您仍然需要跳过这么多字节，不仅是在存档文件中，而且是在解压缩的数据流中，即for some compression formats works, but for others requires you to decompress everything in between。

gzip属于后一类压缩方案。因此，虽然不将不需要的文件写入磁盘可以节省一些时间，但代码仍然会对它们进行解压缩。您可以通过重写非gzip存档的^{} class来解决这个问题，但是对于您的gz文件，您无能为力。

网友

2楼 · 编辑于 2024-10-05 11:44:34

通过将tarfile作为流打开，可以更有效地执行此操作。（https://docs.python.org/2/library/tarfile.html#tarfile.open）

mkdir tartest
cd tartest/
dd if=/dev/urandom of=file1 count=100 bs=1M
dd if=/dev/urandom of=file2 count=100 bs=1M
dd if=/dev/urandom of=file3 count=100 bs=1M
dd if=/dev/urandom of=file4 count=100 bs=1M
dd if=/dev/urandom of=file5 count=100 bs=1M
cd ..
tar czvf test.tgz tartest

现在这样读：

import tarfile
fileName = "test.tgz"
tfile = tarfile.open(fileName, 'r|gz')
for t in tfile:
    if "file3" in t.name: 
        f = tfile.extractfile(t)
        if f:
            print(len(f.read()))

注意open命令中的|。我们只读了file3。

$ time python test.py

104857600

real    0m1.201s
user    0m0.820s
sys     0m0.377s

如果我把r|gz改回r:gz，我得到：

$ time python test.py 
104857600

real    0m7.033s
user    0m6.293s
sys     0m0.730s

大约快5倍（因为我们有5个大小相同的文件）。这是因为标准的打开方式允许向后搜索；它只能在压缩的tar文件中通过提取（我不知道确切的原因）。如果你以流的形式打开，你就不能再随机搜索了，但是如果你按顺序阅读，这在你的情况下是可能的，它会快得多。但是，您不能再提前到getnames。但在这种情况下这是不必要的。

相关问题更多 >

编程相关推荐

热门问题

热门文章