CUDA目标的Numba和guvectorize：代码运行速度低于预期问题的回答

CUDA目标的Numba和guvectorize：代码运行速度低于预期

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

值得注意的细节 <ul> <li>大型数据集（1000万x 5），（200 x 1000万x 5）</li> <li>大部分是裸体</li> <li>每次跑步后需要更长时间</li> <li>使用Spyder3</li> <li>Windows 10</li> </ul> 第一件事是尝试使用guvectorize和以下函数。我传递了一堆numpy数组，并试图使用它们来在两个数组之间进行乘法。如果使用cuda以外的目标运行，则此操作有效。但是，当切换到cuda时，会导致未知错误： <blockquote> File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug) TypeError: init() got an unexpected keyword argument 'debug' </blockquote> 在遵循了我从这个错误中所能找到的一切之后，我只找到了死胡同。我想这是一个非常简单的修复方法，我完全不知道，但是哦，好吧。还应该说，只有在运行一次并且由于内存过载而崩溃之后才会发生此错误。在 <pre><code>os.environ["NUMBA_ENABLE_CUDASIM"] = "1" os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842" ... </code></pre> 所有数组都是numpy ^{pr2}$ 尝试在命令行中使用nvprofiler运行代码会导致以下错误： <blockquote> Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this ?multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: <a href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory" rel="nofollow noreferrer">http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory</a> </blockquote> 我意识到我使用的是支持SLI的显卡（两个卡都是相同的，evga gtx 1080ti，并且具有相同的设备id），所以我禁用了SLI并添加了“CUDA_VISIBLE_DEVICES”行来尝试限制另一个卡，但结果相同。在 我仍然可以用nvprof运行代码，但是cuda函数比njit（parallel=True）和prange慢。通过使用较小的数据大小，我们可以运行代码，但它比target='parallel'和target='cpu'慢。在 为什么cuda这么慢，这些错误意味着什么？在 谢谢你的帮助！在 编辑：下面是代码的一个工作示例： <pre><code>import numpy as np from numba import guvectorize import time from timeit import default_timer as timer @guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True) def cVestDiscount (countRow, multBy, discount, n, countCol, cv): for as_of_date in range(0,countRow): for ID in range(0,countCol): for num in range(0,n): cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num] countRow = np.int64(100) multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(100,4000,5)) n = np.int64(5) countCol = np.int64(4000) cv = np.zeros(shape=(100,4000,5), dtype=np.float64) func_start = timer() cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) </code></pre> 我可以使用GTX1080TI在cuda中运行代码，但是，它比并行或cpu运行要慢得多。我看过其他与guvectorize相关的帖子，但是没有一篇文章能帮助我理解在guvectorize中运行什么是最好的，什么不是最好的。有没有办法让这个代码成为“cuda友好”的呢？或者仅仅是在数组间进行乘法运算太简单了以至于看不到任何好处？在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

首先，您所展示的基本操作是获取两个矩阵，将它们传输到GPU，进行一些元素乘法以生成第三个数组，然后将第三个数组传递回主机。在 可以制作一个numba/cuda guvectorize（或库达.jit内核）实现，它可能比简单的串行python实现运行得更快，但我怀疑是否有可能超过编写良好的宿主代码的性能（例如使用一些并行化方法，比如guvectorize）来完成同样的事情。这是因为在主机和设备之间传输的每字节的算术强度太低了。这个操作太简单了。在 其次，我相信，首先要理解numba <code>vectorize</code>和{<cd2>}的意图。基本原则是从“一个工人将做什么”的角度来编写ufunc定义然后让numba从中产生多个工人。您指示numba启动多个worker的方法是传递一个大于您给出的签名的数据集。需要注意的是numba不知道如何在ufunc定义中并行for循环。它通过获取ufunc定义并在并行worker之间运行它来获得并行的“强度”，其中每个worker处理一个“片段”数据，但在该片段上运行整个ufunc定义。作为一些附加阅读，我也讨论了一些这方面的内容<a href="https://stackoverflow.com/questions/49646182/numba-vectorize-for-cuda-what-is-the-correct-signature-to-return-arrays/49661183#49661183">here</a>。在 因此，我们在实现中遇到的一个问题是，您编写了一个签名（和ufunc），它将整个输入数据集映射到单个worker。正如@talonmies所显示的那样，你的底层内核总共有64个线程/工作线程（这在GPU上是非常小的，即使上面关于算术强度的声明除外），但我怀疑64实际上只是numba最小线程块大小，而实际上该线程块中只有1个线程在运行任何有用的计算工作。一个线程正在以串行方式执行整个ufunc，包括所有for循环。在 显然，这不是任何人想要合理使用<code>vectorize</code>或{<cd2>}的。在 所以让我们重新审视一下你想做什么。最终，ufunc希望将一个数组的输入值乘以另一个数组的输入值，并将结果存储到第三个数组中。我们想多次重复这个过程。如果所有3个数组大小都相同，我们实际上可以用<code>vectorize</code>实现这一点，甚至不必求助于更复杂的<code>guvectorize</code>。让我们将该方法与您的原始方法进行比较，重点是CUDA内核的执行。下面是一个有效的示例，其中t14.py是您的原始代码，使用profiler运行，t15.py是它的<code>vectorize</code>版本，承认我们已经更改了<code>multBy</code>数组的大小，以匹配<code>cv</code>和<code>discount</code>： <pre><code>$ nvprof print-gpu-trace python t14.py ==4145== NVPROF is profiling process 4145, command: python t14.py Function: discount factor cumVest duration (seconds):1.24354910851 ==4145== Profiling application: python t14.py ==4145== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name 312.36ms 1.2160us - - - - - 8B 6.2742MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 312.81ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 313.52ms 5.8696ms - - - - - 15.259MB 2.5387GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 319.74ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 319.93ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 321.40ms 1.22538s (1 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37] 1.54678s 7.1816ms - - - - - 15.259MB 2.0749GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block. SrcMemType: The type of source memory accessed by memory operation/copy DstMemType: The type of destination memory accessed by memory operation/copy $ cat t15.py import numpy as np from numba import guvectorize,vectorize import time from timeit import default_timer as timer @vectorize(['float64(float64, float64)'], target='cuda') def cVestDiscount (a, b): return a * b discount = np.float64(np.arange(2000000).reshape(100,4000,5)) multBy = np.full_like(discount, 1) cv = np.empty_like(discount) func_start = timer() cv = cVestDiscount(multBy, discount) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) $ nvprof print-gpu-trace python t15.py ==4167== NVPROF is profiling process 4167, command: python t15.py Function: discount factor cumVest duration (seconds):0.37507891655 ==4167== Profiling application: python t15.py ==4167== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name 193.92ms 6.2729ms - - - - - 15.259MB 2.3755GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 201.09ms 5.7101ms - - - - - 15.259MB 2.6096GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 364.92ms 842.49us (15625 1 1) (128 1 1) 13 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31] 365.77ms 7.1528ms - - - - - 15.259MB 2.0833GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block. SrcMemType: The type of source memory accessed by memory operation/copy DstMemType: The type of destination memory accessed by memory operation/copy $ </code></pre> 我们看到，您的应用程序报告的运行时间约为1.244秒，而矢量化版本报告的运行时间约为0.375秒。但这两个数字都有python开销。如果我们看一下分析器中生成的CUDA内核持续时间，差别就更加明显了。我们看到原始内核大约需要1.225秒，而向量化内核的执行时间大约为842微秒（即不到1毫秒）。我们还注意到，计算内核时间现在比从GPU传输3个数组所需的时间要小得多（总共需要20毫秒），而且我们还注意到内核的大小现在是15625个块，每个块有128个线程，线程/工作线程总数为2000000，完全匹配要完成的乘法操作的总数，并且远远超过原始代码中微不足道的64个线程（可能，实际上只有1个线程）。在 鉴于上述<code>vectorize</code>方法的简单性，如果你真正想做的是这个元素级乘法，然后你可以考虑复制<code>multBy</code>，这样它就可以在维度上匹配另外两个数组，然后用它来完成。在 但问题仍然存在：如何处理不同的输入数组大小，就像原来的问题一样？为此，我认为我们需要转到<code>guvectorize</code>（或者，正如@talonmies所指出的那样，编写您自己的<code>@cuda.jit</code>内核，这可能是最好的建议，尽管这些方法都无法克服向设备/从设备传输数据的开销，如前所述）。在 为了用<code>guvectorize</code>来解决这个问题，我们需要更仔细地考虑前面提到的“切片”概念。让我们重新编写您的<code>guvectorize</code>内核，使其只对整个数据的“片段”进行操作，然后允许<code>guvectorize</code>启动函数启动多个worker来处理它，每个片一个worker。在 在CUDA，我们喜欢有很多工人，你真的不能有太多。因此，这将影响我们如何“切片”数组，从而为多个工作线程提供操作的可能性。如果我们沿着第三个维度（最后一个维度，<code>n</code>）切片，我们将只有5个切片来处理，因此最多有5个工人。同样地，如果我们沿着第一个维度或者<code>countRow</code>维度切片，我们将有100个切片，因此最多有100个工人。理想情况下，我们将沿着第二维度或<code>countCol</code>维度进行切片。不过，为了简单起见，我将沿着第一个维度，即<code>countRow</code>维度进行切片。这显然是非最优的，但请参阅下面的示例，以了解如何处理按二维进行切片的问题。按第一维度切片意味着我们将从guvectorize内核中删除第一个for循环，并允许ufunc系统沿着该维度并行化（基于我们传递的数组的大小）。代码可能如下所示： ^{pr2}$ 观察结果： <ol> <li>代码更改涉及到删除<code>countCol</code>参数，从guvectorize内核中删除第一个for循环，并对函数签名进行适当的更改以反映这一点。我们还将签名中的三维函数修改为二维。毕竟，我们是从三维数据中提取一个二维的“切片”，让每个工人在一个切片上工作。</li> <li>探查器报告的内核维度现在是2个块，而不是1个。这是有意义的，因为在最初的实现中，实际上只显示了1个“slice”，因此需要1个worker，因此需要1个线程（但是numba启动了64个线程的1个线程块）。在这个实现中，有100个slice，numba选择旋转两个64个worker/threads的线程块，以提供所需的100个worker/threads。</li> <li>分析器报告的47.4ms内核性能现在介于原始版本（~1.224s）和大规模并行版本<code>vectorize</code>之间（在~0.001s）。因此，从1名员工增加到100名员工大大加快了工作进度，但仍有可能获得更多的绩效提升。如果您知道如何在<code>countCol</code>维度上进行切片，那么您可能会更接近<code>vectorize</code>版本的性能（见下文）。请注意，我们现在所处的位置（~47ms）和矢量化版本（~1ms）之间的差异足以弥补将稍大的<code>multBy</code>矩阵传输到设备的额外传输成本（~5ms，或更少），以简化<code>vectorize</code>。</li> </ol> 关于python计时的一些补充意见：我相信python为原始版本、向量化版本和guvectorize改进版本编译必要内核的具体行为是不同的。如果我们修改t15.py代码来运行“预热”运行，那么至少python的计时是一致的，从趋势上看，它与整个墙时间和仅内核的计时是一致的： <pre><code>$ cat t15.py import numpy as np from numba import guvectorize,vectorize import time from timeit import default_timer as timer @vectorize(['float64(float64, float64)'], target='cuda') def cVestDiscount (a, b): return a * b multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(100,4000,5)) multBy = np.full_like(discount, 1) cv = np.empty_like(discount) #warm-up run cv = cVestDiscount(multBy, discount) func_start = timer() cv = cVestDiscount(multBy, discount) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) [bob@cluster2 python]$ time python t14.py Function: discount factor cumVest duration (seconds):1.24376320839 real 0m2.522s user 0m1.572s sys 0m0.809s $ time python t15.py Function: discount factor cumVest duration (seconds):0.0228319168091 real 0m1.050s user 0m0.473s sys 0m0.445s $ time python t16.py Function: discount factor cumVest duration (seconds):0.0665760040283 real 0m1.252s user 0m0.680s sys 0m0.441s $ </code></pre> 现在，有效地回答了评论中的一个问题：“如何我是否要重新计算问题以沿4000（<code>countCol</code>，或“middle”）维度切片在 我们可以根据第一维度的工作原理来指导我们。一种可能的方法是重新排列数组的形状，使4000维成为第一维，然后删除它，类似于我们在前面处理<code>guvectorize</code>时所做的。下面是一个有效的例子： <pre><code>$ cat t17.py import numpy as np from numba import guvectorize import time from timeit import default_timer as timer @guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True) def cVestDiscount (countCol, multBy, discount, n, cv): for ID in range(0,countCol): for num in range(0,n): cv[ID][num] = multBy[num] * discount[ID][num] countRow = np.int64(100) multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(4000,100,5)) n = np.int64(5) countCol = np.int64(4000) cv = np.zeros(shape=(4000,100,5), dtype=np.float64) func_start = timer() cv = cVestDiscount(countRow, multBy, discount, n, cv) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) [bob@cluster2 python]$ python t17.py Function: discount factor cumVest duration (seconds):0.0266749858856 $ nvprof print-gpu-trace python t17.py ==8544== NVPROF is profiling process 8544, command: python t17.py Function: discount factor cumVest duration (seconds):0.0268459320068 ==8544== Profiling application: python t17.py ==8544== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name 304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35] 322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block. SrcMemType: The type of source memory accessed by memory operation/copy DstMemType: The type of destination memory accessed by memory operation/copy $ </code></pre> 在某种程度上可以预见，我们观察到，执行时间从分割为100个工人时的约47毫秒下降到分割为4000个工人时的约9毫秒。类似地，我们观察到numba正在选择旋转63个块，每个块64个线程，总共4032个线程，以处理这个“切片”所需的4000个工人。在 仍然不如~1ms<code>vectorize</code>内核（它为工人提供了更多可用的并行“切片”），但比原问题中提出的~1.2s内核快得多。而且python代码的总体运行时间要快2倍，即使有python的开销。在 最后，让我们回顾一下我之前的陈述（与评论和其他回答中的陈述类似）： <blockquote> "I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing." </blockquote> 现在我们在t16.py或t17.py中有了方便的测试用例，我们可以使用它们来测试这一点。为了简单起见，我选择t16.py。我们只需从<code>guvectorize</code>ufunc中删除目标名称，就可以“将此代码转换回CPU代码”： <pre><code>$ cat t16a.py import numpy as np from numba import guvectorize import time from timeit import default_timer as timer @guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)') def cVestDiscount (multBy, discount, n, countCol, cv): for ID in range(0,countCol): for num in range(0,n): cv[ID][num] = multBy[ID][num] * discount[ID][num] multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(100,4000,5)) n = np.int64(5) countCol = np.int64(4000) cv = np.zeros(shape=(100,4000,5), dtype=np.float64) func_start = timer() cv = cVestDiscount(multBy, discount, n, countCol, cv) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) $ time python t16a.py Function: discount factor cumVest duration (seconds):0.00657796859741 real 0m0.528s user 0m0.474s sys 0m0.047s $ </code></pre> 因此，我们看到这个只有CPU的版本在大约6毫秒内运行函数，并且它没有GPU的“开销”，比如CUDA初始化和数据到GPU的复制。整体壁时也是我们最好的测量方法，大约为0.5秒，而我们最好的GPU案例是大约1.0秒。所以这个特殊的问题，由于它的低算术强度每字节的数据传输，可能不太适合GPU计算。在

CUDA目标的Numba和guvectorize：代码运行速度低于预期

1 个回答

相关Python问题