<p><strong>更新:</strong>我的笔记本有16GB的RAM,所以我将用4倍(64GB/16GB=4)的DF测试它:</p>
<p>设置:</p>
<pre><code>In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)
In [2]: df.shape
Out[2]: (12000, 47395)
In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop
</code></pre>
<p>我们也将此数据框保存为羽毛格式:</p>
^{pr2}$
<p>然后读回来:</p>
<pre><code>In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop # reading is reasonably fast as well
</code></pre>
<p>从CSV文件中分块读取要慢得多,但仍然不能给我<code>MemoryError</code>:</p>
<pre><code>In [2]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop
</code></pre>
<p>现在让我们显式地指定<code>dtype=np.int32</code>:</p>
<pre><code>In [1]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop
</code></pre>
<p>测试HDF存储:</p>
<pre><code>In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop
In [11]: del df
In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop
</code></pre>
<h2>结论:</h2>
<p>如果你有机会改变你的存储文件格式-无论如何不要使用CSV文件-使用HDF5(.h5)或羽毛格式。。。在</p>
<p><strong>旧答案:</strong></p>
<p>我只需使用原生熊猫<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html" rel="nofollow noreferrer">read_csv()</a>方法:</p>
<pre><code>chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)
</code></pre>
<p>根据您的代码:</p>
<blockquote>
<p>tag = row[0]</p>
<p>df.loc[tag] = np.array(row[1:], dtype=dftype)</p>
</blockquote>
<p>看起来您希望使用CSV文件中的第一列作为索引,因此:<code>index_col=0</code></p>