使用dask.distributed强制或显式数据再平衡

#connect to the Dask scheduler from dask.distributed import Client, Sub, Pub, fire_and_forget client = Client(scheduler_file='../scheduler.json', set_as_default=True) #load data into a numpy array import numpy as np npvol = np.array(np.fromfile('/home/nleaf/data/RegGrid/Vorts_t50_128x128x128_f32.raw', dtype=np.float32)) npvol = npvol.reshape([128,128,128]) #convert numpy array to a dask array import dask.array as da ar = da.from_array(npvol).rechunk([npvol.shape[0], npvol.shape[1], npvol.shape[2]/N]) def test(ar): from mpi4py import MPI rank = MPI.COMM_WORLD.Get_rank() return np.array([rank], ndmin=3, dtype=np.int) client.rebalance() print(client.persist(ar.map_blocks(test, chunks=(1,1,1))).compute())

1条回答

网友

1楼 · 发布于 2024-09-27 18:23:53

由于您的总数据集没有那么大，对from\ array的初始调用只是创建一个块，因此它只属于一个worker（您可以用chunks=另外指定）。如果可能的话，下面的rechunk倾向于不移动数据。你知道吗

假设每个worker都可以访问您的文件，那么最好首先在worker中加载块。你知道吗

你需要这样的函数

def get_chunk(fn, offset, count, shape, dtype):
    with open(fn, 'rb') as f:
        f.seek(offset)
        return np.fromfile(f, dtype=dtype, count=count).reshape(shape)

并为每个块传递不同的偏移量。你知道吗

parts = [da.from_delayed(dask.delayed(get_chunk)(fn, offset, count, shape, dtype), shape, dtype) for offset in [...]]
arr = da.concat(parts)

这与Intake中的npy source自动执行的操作非常相似，代码：https://github.com/intake/intake/blob/master/intake/source/npy.py#L11

相关问题更多 >

编程相关推荐

热门问题

热门文章