NumPy/SciPy中的多线程整数矩阵乘法

array: np.random.randint(2, size=shape).astype(dtype) dtype shape %time (average) float32 (2000, 2000) 62.5 ms float32 (3000, 3000) 219 ms float32 (4000, 4000) 328 ms float32 (10000, 10000) 4.09 s int8 (2000, 2000) 13 seconds int8 (3000, 3000) 3min 26s int8 (4000, 4000) 12min 20s int8 (10000, 10000) It didn't finish in 6 hours float16 (2000, 2000) 2min 25s float16 (3000, 3000) Not tested float16 (4000, 4000) Not tested float16 (10000, 10000) Not tested

2条回答

网友
1楼 · 编辑于 2024-09-30 01:22:26

选项5-滚动自定义解决方案：将矩阵乘积划分为几个子产品并并行执行。使用标准Python模块可以相对容易地实现这一点。子产品使用numpy.dot计算，这将释放全局解释器锁。因此，可以使用相对轻量级的threads，它可以从主线程访问数组以提高内存效率。在
实施：
import numpy as np from numpy.testing import assert_array_equal import threading from time import time def blockshaped(arr, nrows, ncols): """ Return an array of shape (nrows, ncols, n, m) where n * nrows, m * ncols = arr.shape. This should be a view of the original array. """ h, w = arr.shape n, m = h // nrows, w // ncols return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2) def do_dot(a, b, out): #np.dot(a, b, out) # does not work. maybe because out is not C-contiguous? out[:] = np.dot(a, b) # less efficient because the output is stored in a temporary array? def pardot(a, b, nblocks, mblocks, dot_func=do_dot): """ Return the matrix product a * b. The product is split into nblocks * mblocks partitions that are performed in parallel threads. """ n_jobs = nblocks * mblocks print('running {} jobs in parallel'.format(n_jobs)) out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype) out_blocks = blockshaped(out, nblocks, mblocks) a_blocks = blockshaped(a, nblocks, 1) b_blocks = blockshaped(b, 1, mblocks) threads = [] for i in range(nblocks): for j in range(mblocks): th = threading.Thread(target=dot_func, args=(a_blocks[i, 0, :, :], b_blocks[0, j, :, :], out_blocks[i, j, :, :])) th.start() threads.append(th) for th in threads: th.join() return out if __name__ == '__main__': a = np.ones((4, 3), dtype=int) b = np.arange(18, dtype=int).reshape(3, 6) assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b)) a = np.random.randn(1500, 1500).astype(int) start = time() pardot(a, a, 2, 4) time_par = time() - start print('pardot: {:.2f} seconds taken'.format(time_par)) start = time() np.dot(a, a) time_dot = time() - start print('np.dot: {:.2f} seconds taken'.format(time_dot))
通过这个实现，我得到了大约x4的加速，这是我机器中核心的物理数量：
^{pr2}$

网友
2楼 · 编辑于 2024-09-30 01:22:26

“Why is it faster to perform float by float matrix multiplication compared to int by int?”解释了为什么整数这么慢：首先，cpu有高吞吐量的浮点管道。其次，BLAS没有整数类型。在
解决方法：将矩阵转换为float32值可以获得很大的加速。2015款MacBook Pro的90x加速性能如何？（使用float64效果是一半。）
import numpy as np import time def timeit(callable): start = time.time() callable() end = time.time() return end - start a = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8) timeit(lambda: a.dot(a)) # ≈0.9 sec timeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) ) # ≈0.01 sec

相关问题更多 >

编程相关推荐

热门问题

热门文章