<p>在浏览了Pandas的<a href="https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/io/formats/csvs.py#L123" rel="nofollow noreferrer">CSV writing</a>代码之后,我建议最好直接使用<code>gzip</code>模块。这样您就可以直接设置<a href="https://docs.python.org/3/library/gzip.html#gzip.GzipFile" rel="nofollow noreferrer">^{<cd2>} attribute</a>,这似乎就是您想要的:</p>
<pre><code>import pandas as pd
from gzip import GzipFile
from io import TextIOWrapper
def to_gzip_csv_no_timestamp(df, f, *kwargs):
# Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
# header, using GzipFile and TextIOWrapper.
#
# Args:
# df: pandas DataFrame.
# f: Filename string ending in .csv (not .csv.gz).
# *kwargs: Other arguments passed to to_csv().
#
# Returns:
# Nothing.
with TextIOWrapper(GzipFile(f, 'w', mtime=0), encoding='utf-8') as fd:
df.to_csv(fd, *kwargs)
to_gzip_csv_no_timestamp(df, 'df.csv.gz')
to_gzip_csv_no_timestamp(df, 'df2.csv.gz')
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True
</code></pre>
<p>对于这个小数据集,这优于下面的两步<code>subprocess</code>方法:</p>
<pre><code>%timeit to_gzip_csv_no_timestamp(df, 'df.csv')
693 us +- 14.6 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)
%timeit to_gzip_csv_no_timestamp_subprocess(df, 'df.csv')
10.2 ms +- 167 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
</code></pre>
<p>我使用<code>TextIOWrapper()</code>将字符串转换为字节作为<a href="https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/io/common.py#L298" rel="nofollow noreferrer">Pandas does</a>处理,但如果您知道不会保存太多数据,也可以这样做:</p>
<pre><code>with GzipFile('df.csv.gz', 'w', mtime=0) as fd:
fd.write(df.to_csv().encode('utf-8'))
</code></pre>
<p>注意,<code>gzip -lv df.csv.gz</code>仍然显示“当前时间”,但它只是从inode的mtime中提取这个值。使用<code>hexdump -C</code>转储显示值保存在文件中,更改文件mtime(使用<code>touch -mt 0711171533 df.csv.gz</code>)会导致<code>gzip</code>显示不同的值</p>
<p>还要注意,原始的“filename”也是gzip文件的一部分,因此您需要写入相同的名称(或者重写此名称)以使其具有确定性。你知道吗</p>