该文件包含2000000行: 每行包含208列,用逗号分隔,如下所示:
0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
程序将这个文件读入一个numpynarray,我预计它将消耗大约(2000000 * 208 * 8B) = 3.2GB
内存。
但是,当程序读取这个文件时,我发现程序消耗了大约20GB的内存。在
我很困惑为什么我的程序会消耗这么多内存而无法满足预期?在
我认为您应该尝试
pandas
来处理大数据(文本文件)。熊猫就像Python中的高手。它在内部使用numpy
来表示数据。在HDF5文件也是另一种将大数据保存到HDF5二进制文件中的方法。在
这个问题将提供一些关于如何处理大文件的想法-"Large data" work flows using pandas
我使用的是Numpy 1.9.0,}的内存不足似乎与它们基于临时列表来存储数据直接相关:
np.loadtxt()
和{np.loadtxt()
np.genfromtxt()
通过预先知道数组的
shape
,您可以想到一个文件读取器,它将使用与之对应的dtype
存储数据,消耗的内存量非常接近理论内存量(本例为3.2 GB):相关问题 更多 >
编程相关推荐