读取大型HDF5文件

网友

1楼 · 编辑于 2024-09-27 09:36:18

崩溃可能意味着内存不足。正如Vignesh Pillay所建议的，我会尝试将数据分块，一次处理一小块数据。如果使用pandas方法read_hdf，则可以使用迭代器和chunksize参数来控制分块：

import pandas as pd
data_iter = pd.read_hdf('/tmp/test.hdf', key='test_key', iterator=True, chunksize=100)
for chunk in data_iter:
   #train cnn on chunk here
   print(chunk.shape)

注：这要求hdf采用表格格式

网友

2楼 · 编辑于 2024-09-27 09:36:18

您的问题是在内存不足时出现的。因此，虚拟数据集在处理像您这样的大型数据集时非常方便。虚拟数据集允许通过接口层将多个真实数据集映射到单个可切片数据集。你可以在这里阅读更多关于他们的信息https://docs.h5py.org/en/stable/vds.html

我建议您一次从一个文件开始。首先，为现有数据创建一个虚拟数据集文件，如

with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5", 'r') as db:
     data_shape = db['data'].shape
     layout = h5py.VirtualLayout(shape = (data_shape), dtype = np.uint8)
     vsource = h5py.VirtualSource(db['data'])
     with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'w', libver = 'latest') as file:
         file.create_virtual_dataset('data', layout = layout, fillvalue = 0)

这将创建现有培训数据的虚拟数据集。现在，如果您想操作数据，您应该像这样以r+模式打开文件

with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'r+', libver = 'latest') as file:
    # Do whatever manipulation you want to do here

我想建议的另一件事是，确保切片时的索引是int数据类型，否则会出现错误

网友
3楼 · 编辑于 2024-09-27 09:36:18

我的答案更新了2020-08-03，以反映您添加到问题中的代码。正如@Tober所指出的，您的内存正在耗尽。读取形状数据集（206702242243）将成为3.1G实体的列表。如果读取3个图像集，则需要更多的RAM。我假设这是图像数据（可能是20670个形状的图像（2242243））？如果是这样，您可以使用h5py和tables（Pytables）读取片中的数据。这将以NumPy数组的形式返回数据，您可以直接使用它（无需操作到不同的数据结构中）

基本流程如下所示：

with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5",'r') as db:
     training_db = db['data']
     # loop to get images 1 by 1
     for icnt in range(20670) :
         image_arr = training_db [icnt,:,:,:}

     # then do something with the image

您还可以通过将第一个索引设置为一个范围（例如icnt:icnt+100），然后适当地处理循环来读取多个图像

相关问题更多 >

编程相关推荐

热门问题

热门文章