为什么numpy narray从文件中读取会占用这么多内存?

2024-10-02 10:30:14 发布

您现在位置:Python中文网/ 问答频道 /正文

该文件包含2000000行: 每行包含208列,用逗号分隔,如下所示:

0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0

程序将这个文件读入一个numpynarray,我预计它将消耗大约(2000000 * 208 * 8B) = 3.2GB内存。 但是,当程序读取这个文件时,我发现程序消耗了大约20GB的内存。在

我很困惑为什么我的程序会消耗这么多内存而无法满足预期?在


Tags: 文件内存程序逗号gb消耗numpynarray
2条回答

我认为您应该尝试pandas来处理大数据(文本文件)。熊猫就像Python中的高手。它在内部使用numpy来表示数据。在

HDF5文件也是另一种将大数据保存到HDF5二进制文件中的方法。在

这个问题将提供一些关于如何处理大文件的想法-"Large data" work flows using pandas

我使用的是Numpy 1.9.0,np.loadtxt()和{}的内存不足似乎与它们基于临时列表来存储数据直接相关:

  • 请参见here了解np.loadtxt()
  • here代表np.genfromtxt()

通过预先知道数组的shape,您可以想到一个文件读取器,它将使用与之对应的dtype存储数据,消耗的内存量非常接近理论内存量(本例为3.2 GB):

def read_large_txt(path, delimiter=None, dtype=None):
    with open(path) as f:
        nrows = sum(1 for line in f)
        f.seek(0)
        ncols = len(f.next().split(delimiter))
        out = np.empty((nrows, ncols), dtype=dtype)
        f.seek(0)
        for i, line in enumerate(f):
            out[i] = line.split(delimiter)
    return out

相关问题 更多 >

    热门问题