PyCud中三维阵列的就地转置

import numpy as np from pycuda import compiler, gpuarray import pycuda.driver as cuda import pycuda.autoinit kernel_code = """ __global__ void test_indexTranspose(uint*A){ const size_t size_x = 4; const size_t size_y = 4; const size_t size_z = 3; // Thread position in each dimension const size_t tx = blockDim.x * blockIdx.x + threadIdx.x; const size_t ty = blockDim.y * blockIdx.y + threadIdx.y; const size_t tz = blockDim.z * blockIdx.z + threadIdx.z; if(tx < size_x && ty < size_y && tz < size_z){ // Flat index const size_t ti = tz * size_x * size_y + ty * size_x + tx; // Transposed flat index const size_t tiT = tz * size_x * size_y + tx * size_x + ty; A[ti] = tiT; } } """ A = np.zeros((4,4,3),dtype=np.uint32) mod = compiler.SourceModule(kernel_code) test_indexTranspose = mod.get_function('test_indexTranspose') A_gpu = gpuarray.to_gpu(A) test_indexTranspose(A_gpu, block=(2, 2, 1), grid=(2,2,3))

A_gpu.get()[:,:,0] array([[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]], dtype=uint32) A_gpu.get()[:,:,1] array([[16, 20, 24, 28], [17, 21, 25, 29], [18, 22, 26, 30], [19, 23, 27, 31]], dtype=uint32) A_gpu.get()[:,:,2] ...

1条回答

网友

1楼 · 发布于 2024-10-03 23:21:36

使用与CUDA内核代码一致的步幅创建numpy数组解决了这个问题。numpy数组的默认布局不是内核假定的行、列、深度。但是，可以在创建阵列时设置跨距。
如果数组是这样创建的，上面的内核可以正常工作：

nRows = 4
nCols = 4
nSlices = 3
nBytes = np.dtype(np.uint32).itemsize
A = np.ndarray(shape=(nRows, nCols, nSlices), 
               dtype=np.uint32, 
               strides=(nCols*nBytes, 1*nBytes, nCols*nRows*nBytes))

跨步是连续索引在内存中的跳跃，对于每个维度（以字节为单位）。E、 g.从第1行的第一个元素到第2行的第一个元素有nCols * nBytes，即16个字节。在

相关问题更多 >

编程相关推荐

热门问题

热门文章