<p>我确实喜欢@root的解决方案,但我会进一步优化内存使用率—只在内存中保留聚合的DF,只读取那些列,这是您真正需要的:</p>
<pre><code>cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
</code></pre>
<p>测试数据:</p>
<pre><code>Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
</code></pre>
<p>输出.csv:</p>
<pre><code>Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
</code></pre>
<p>PS使用这种方法可以处理大量文件。</p>
<p>除非您需要对数据进行排序,否则使用分块方法的PPS应该可以工作——在本例中,我将使用经典的UNIX工具,如<code>awk</code>、<code>sort</code>等,首先对数据进行排序</p>
<p>我还建议使用PyTables(HDF5存储),而不是CSV文件-它非常快,允许有条件地读取数据(使用<code>where</code>参数),因此它非常方便,节省了大量资源,通常与CSV相比<a href="https://stackoverflow.com/questions/37010212/what-is-the-fastest-way-to-upload-a-big-csv-file-in-notebook-to-work-with-python/37012035#37012035">much faster</a>。</p>