CUDA目标的Numba和guvectorize：代码运行速度低于预期问题的回答

CUDA目标的Numba和guvectorize：代码运行速度低于预期

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

值得注意的细节 <ul> <li>大型数据集（1000万x 5），（200 x 1000万x 5）</li> <li>大部分是裸体</li> <li>每次跑步后需要更长时间</li> <li>使用Spyder3</li> <li>Windows 10</li> </ul> 第一件事是尝试使用guvectorize和以下函数。我传递了一堆numpy数组，并试图使用它们来在两个数组之间进行乘法。如果使用cuda以外的目标运行，则此操作有效。但是，当切换到cuda时，会导致未知错误： <blockquote> File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug) TypeError: init() got an unexpected keyword argument 'debug' </blockquote> 在遵循了我从这个错误中所能找到的一切之后，我只找到了死胡同。我想这是一个非常简单的修复方法，我完全不知道，但是哦，好吧。还应该说，只有在运行一次并且由于内存过载而崩溃之后才会发生此错误。在 <pre><code>os.environ["NUMBA_ENABLE_CUDASIM"] = "1" os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842" ... </code></pre> 所有数组都是numpy ^{pr2}$ 尝试在命令行中使用nvprofiler运行代码会导致以下错误： <blockquote> Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this ?multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: <a href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory" rel="nofollow noreferrer">http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory</a> </blockquote> 我意识到我使用的是支持SLI的显卡（两个卡都是相同的，evga gtx 1080ti，并且具有相同的设备id），所以我禁用了SLI并添加了“CUDA_VISIBLE_DEVICES”行来尝试限制另一个卡，但结果相同。在 我仍然可以用nvprof运行代码，但是cuda函数比njit（parallel=True）和prange慢。通过使用较小的数据大小，我们可以运行代码，但它比target='parallel'和target='cpu'慢。在 为什么cuda这么慢，这些错误意味着什么？在 谢谢你的帮助！在 编辑：下面是代码的一个工作示例： <pre><code>import numpy as np from numba import guvectorize import time from timeit import default_timer as timer @guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True) def cVestDiscount (countRow, multBy, discount, n, countCol, cv): for as_of_date in range(0,countRow): for ID in range(0,countCol): for num in range(0,n): cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num] countRow = np.int64(100) multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(100,4000,5)) n = np.int64(5) countCol = np.int64(4000) cv = np.zeros(shape=(100,4000,5), dtype=np.float64) func_start = timer() cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) </code></pre> 我可以使用GTX1080TI在cuda中运行代码，但是，它比并行或cpu运行要慢得多。我看过其他与guvectorize相关的帖子，但是没有一篇文章能帮助我理解在guvectorize中运行什么是最好的，什么不是最好的。有没有办法让这个代码成为“cuda友好”的呢？或者仅仅是在数组间进行乘法运算太简单了以至于看不到任何好处？在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

gufunc Numba发射和运行如此缓慢的原因在剖析时立即变得显而易见（Numba 0.38.1与cuda8.0） <pre><code>==24691== Profiling application: python slowvec.py ==24691== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name 271.33ms 1.2800us - - - - - 8B 5.9605MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 271.65ms 14.591us - - - - - 156.25KB 10.213GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 272.09ms 2.5868ms - - - - - 15.259MB 5.7605GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 274.98ms 992ns - - - - - 8B 7.6909MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 275.17ms 640ns - - - - - 8B 11.921MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 276.33ms 657.28ms (1 1 1) (64 1 1) 40 0B 0B - - GeForce GTX 970 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38] 933.62ms 3.5128ms - - - - - 15.259MB 4.2419GB/s GeForce GTX 970 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block. </code></pre> 运行代码的最终内核启动使用64个线程的单个块。在一个GPU上，理论上每mp2048个线程，23mp，这意味着你的GPU理论处理能力的99.9%没有被使用。这看起来像是numba开发人员的一个荒谬的设计选择，如果你被它阻碍了，我会把它作为一个bug来报告（看起来你是这样）。在 显而易见的解决方案是将函数重写为cudapython内核方言中的<code>@cuda.jit</code>函数，并显式地控制执行参数。这样，您至少可以确保代码运行时有足够的线程来潜在地使用您的硬件的所有容量。它仍然是一个内存受限的操作，因此您可以实现的加速可能会被限制在远低于您的GPU的内存带宽与CPU的比率。而且，这可能不足以分摊主机到设备内存传输的成本，因此在最好的情况下，性能可能不会提高，尽管这还远远不够。在 简而言之，要小心automagic编译器生成的并行性的危险。。。。在 Postscript补充说，我设法弄清楚了如何获得numba发出的代码的PTX，并且足以说明这绝对是一个废话（而且我不能真正发布所有这些东西）： ^{pr2}$ 所有这些整型运算都只执行一个双精度乘法！在

CUDA目标的Numba和guvectorize：代码运行速度低于预期

1 个回答

相关Python问题