在archi上不带时间戳的情况下，将数据帧写入gzip csv

import pandas as pd df = pd.DataFrame({'a': [1]}) df.to_csv('df.csv.gz', compression='gzip') # Timestamp is the large number per https://unix.stackexchange.com/a/79546/88807. !<df.csv.gz dd bs=4 skip=1 count=1 | od -t d4 # 1+0 records in # 1+0 records out # 4 bytes copied, 5.6233e-05 s, 71.1 kB/s # 0000000 1546978755 # 0000004df.csv

2条回答

网友

1楼 · 编辑于 2024-09-27 17:59:33

在浏览了Pandas的CSV writing代码之后，我建议最好直接使用gzip模块。这样您就可以直接设置^{} attribute，这似乎就是您想要的：

import pandas as pd
from gzip import GzipFile
from io import TextIOWrapper

def to_gzip_csv_no_timestamp(df, f, *kwargs):
    # Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
    # header, using GzipFile and TextIOWrapper.
    #
    # Args:
    #     df: pandas DataFrame.
    #     f: Filename string ending in .csv (not .csv.gz).
    #     *kwargs: Other arguments passed to to_csv().
    #
    # Returns:
    #     Nothing.
    with TextIOWrapper(GzipFile(f, 'w', mtime=0), encoding='utf-8') as fd:
        df.to_csv(fd, *kwargs)

to_gzip_csv_no_timestamp(df, 'df.csv.gz')
to_gzip_csv_no_timestamp(df, 'df2.csv.gz')

filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True

对于这个小数据集，这优于下面的两步subprocess方法：

%timeit to_gzip_csv_no_timestamp(df, 'df.csv')                                                                                                                                                                                                                                    
693 us +- 14.6 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

%timeit to_gzip_csv_no_timestamp_subprocess(df, 'df.csv')
10.2 ms +- 167 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

我使用TextIOWrapper()将字符串转换为字节作为Pandas does处理，但如果您知道不会保存太多数据，也可以这样做：

with GzipFile('df.csv.gz', 'w', mtime=0) as fd:
    fd.write(df.to_csv().encode('utf-8'))

注意，gzip -lv df.csv.gz仍然显示“当前时间”，但它只是从inode的mtime中提取这个值。使用hexdump -C转储显示值保存在文件中，更改文件mtime（使用touch -mt 0711171533 df.csv.gz）会导致gzip显示不同的值

还要注意，原始的“filename”也是gzip文件的一部分，因此您需要写入相同的名称（或者重写此名称）以使其具有确定性。你知道吗

网友

2楼 · 编辑于 2024-09-27 17:59:33

您可以导出为未压缩的CSV，然后使用-n标志调用gzip，以避免时间戳（这也是不将文件名保存在存档中的说明）：

import subprocess

def to_gzip_csv_no_timestamp_subprocess(df, f, *kwargs):
    # Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
    # header.
    # Args:
    #     df: pandas DataFrame.
    #     f: Filename string ending in .csv (not .csv.gz).
    #     *kwargs: Other arguments passed to to_csv().
    # Returns:
    #     Nothing.
    import subprocess
    df.to_csv(f, *kwargs)
    # -n for the timestamp, -f to overwrite.
    subprocess.check_call(['gzip', '-nf', f])

to_gzip_csv_no_timestamp(df, 'df.csv')
to_gzip_csv_no_timestamp(df, 'df2.csv')
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True

相关问题更多 >

编程相关推荐

热门问题

热门文章