擅长:python、mysql、java
<p>您可以使用<a href="http://dask.pydata.org/en/latest/dataframe.html" rel="noreferrer">^{<cd1>}</a>,这在语法上与<code>pandas</code>类似,但在核心之外执行操作,因此内存不应该是问题:</p>
<pre><code>import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
</code></pre>
<p>或者,如果<code>pandas</code>是一个需求,那么可以使用@chrisaycock提到的分块读取。您可能需要尝试<code>chunksize</code>参数。</p>
<pre><code># Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
</code></pre>