CUDA目标的Numba和guvectorize：代码运行速度低于预期

2条回答

网友

1楼 · 编辑于 2024-09-28 22:34:54

首先，您所展示的基本操作是获取两个矩阵，将它们传输到GPU，进行一些元素乘法以生成第三个数组，然后将第三个数组传递回主机。在

可以制作一个numba/cuda guvectorize（或库达.jit内核）实现，它可能比简单的串行python实现运行得更快，但我怀疑是否有可能超过编写良好的宿主代码的性能（例如使用一些并行化方法，比如guvectorize）来完成同样的事情。这是因为在主机和设备之间传输的每字节的算术强度太低了。这个操作太简单了。在

其次，我相信，首先要理解numba vectorize和{}的意图。基本原则是从“一个工人将做什么”的角度来编写ufunc定义然后让numba从中产生多个工人。您指示numba启动多个worker的方法是传递一个大于您给出的签名的数据集。需要注意的是numba不知道如何在ufunc定义中并行for循环。它通过获取ufunc定义并在并行worker之间运行它来获得并行的“强度”，其中每个worker处理一个“片段”数据，但在该片段上运行整个ufunc定义。作为一些附加阅读，我也讨论了一些这方面的内容here。在

因此，我们在实现中遇到的一个问题是，您编写了一个签名（和ufunc），它将整个输入数据集映射到单个worker。正如@talonmies所显示的那样，你的底层内核总共有64个线程/工作线程（这在GPU上是非常小的，即使上面关于算术强度的声明除外），但我怀疑64实际上只是numba最小线程块大小，而实际上该线程块中只有1个线程在运行任何有用的计算工作。一个线程正在以串行方式执行整个ufunc，包括所有for循环。在

显然，这不是任何人想要合理使用vectorize或{}的。在

所以让我们重新审视一下你想做什么。最终，ufunc希望将一个数组的输入值乘以另一个数组的输入值，并将结果存储到第三个数组中。我们想多次重复这个过程。如果所有3个数组大小都相同，我们实际上可以用vectorize实现这一点，甚至不必求助于更复杂的guvectorize。让我们将该方法与您的原始方法进行比较，重点是CUDA内核的执行。下面是一个有效的示例，其中t14.py是您的原始代码，使用profiler运行，t15.py是它的vectorize版本，承认我们已经更改了multBy数组的大小，以匹配cv和discount：

$ nvprof  print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof  print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

我们看到，您的应用程序报告的运行时间约为1.244秒，而矢量化版本报告的运行时间约为0.375秒。但这两个数字都有python开销。如果我们看一下分析器中生成的CUDA内核持续时间，差别就更加明显了。我们看到原始内核大约需要1.225秒，而向量化内核的执行时间大约为842微秒（即不到1毫秒）。我们还注意到，计算内核时间现在比从GPU传输3个数组所需的时间要小得多（总共需要20毫秒），而且我们还注意到内核的大小现在是15625个块，每个块有128个线程，线程/工作线程总数为2000000，完全匹配要完成的乘法操作的总数，并且远远超过原始代码中微不足道的64个线程（可能，实际上只有1个线程）。在

鉴于上述vectorize方法的简单性，如果你真正想做的是这个元素级乘法，然后你可以考虑复制multBy，这样它就可以在维度上匹配另外两个数组，然后用它来完成。在

但问题仍然存在：如何处理不同的输入数组大小，就像原来的问题一样？为此，我认为我们需要转到guvectorize（或者，正如@talonmies所指出的那样，编写您自己的@cuda.jit内核，这可能是最好的建议，尽管这些方法都无法克服向设备/从设备传输数据的开销，如前所述）。在

为了用guvectorize来解决这个问题，我们需要更仔细地考虑前面提到的“切片”概念。让我们重新编写您的guvectorize内核，使其只对整个数据的“片段”进行操作，然后允许guvectorize启动函数启动多个worker来处理它，每个片一个worker。在

在CUDA，我们喜欢有很多工人，你真的不能有太多。因此，这将影响我们如何“切片”数组，从而为多个工作线程提供操作的可能性。如果我们沿着第三个维度（最后一个维度，n）切片，我们将只有5个切片来处理，因此最多有5个工人。同样地，如果我们沿着第一个维度或者countRow维度切片，我们将有100个切片，因此最多有100个工人。理想情况下，我们将沿着第二维度或countCol维度进行切片。不过，为了简单起见，我将沿着第一个维度，即countRow维度进行切片。这显然是非最优的，但请参阅下面的示例，以了解如何处理按二维进行切片的问题。按第一维度切片意味着我们将从guvectorize内核中删除第一个for循环，并允许ufunc系统沿着该维度并行化（基于我们传递的数组的大小）。代码可能如下所示：

^{pr2}$

观察结果：

代码更改涉及到删除countCol参数，从guvectorize内核中删除第一个for循环，并对函数签名进行适当的更改以反映这一点。我们还将签名中的三维函数修改为二维。毕竟，我们是从三维数据中提取一个二维的“切片”，让每个工人在一个切片上工作。
探查器报告的内核维度现在是2个块，而不是1个。这是有意义的，因为在最初的实现中，实际上只显示了1个“slice”，因此需要1个worker，因此需要1个线程（但是numba启动了64个线程的1个线程块）。在这个实现中，有100个slice，numba选择旋转两个64个worker/threads的线程块，以提供所需的100个worker/threads。
分析器报告的47.4ms内核性能现在介于原始版本（~1.224s）和大规模并行版本vectorize之间（在~0.001s）。因此，从1名员工增加到100名员工大大加快了工作进度，但仍有可能获得更多的绩效提升。如果您知道如何在countCol维度上进行切片，那么您可能会更接近vectorize版本的性能（见下文）。请注意，我们现在所处的位置（~47ms）和矢量化版本（~1ms）之间的差异足以弥补将稍大的multBy矩阵传输到设备的额外传输成本（~5ms，或更少），以简化vectorize。

关于python计时的一些补充意见：我相信python为原始版本、向量化版本和guvectorize改进版本编译必要内核的具体行为是不同的。如果我们修改t15.py代码来运行“预热”运行，那么至少python的计时是一致的，从趋势上看，它与整个墙时间和仅内核的计时是一致的：

$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839

real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091

real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283

real    0m1.252s
user    0m0.680s
sys     0m0.441s
$

现在，有效地回答了评论中的一个问题：“如何我是否要重新计算问题以沿4000（countCol，或“middle”）维度切片在

我们可以根据第一维度的工作原理来指导我们。一种可能的方法是重新排列数组的形状，使4000维成为第一维，然后删除它，类似于我们在前面处理guvectorize时所做的。下面是一个有效的例子：

$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[num] * discount[ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof  print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

在某种程度上可以预见，我们观察到，执行时间从分割为100个工人时的约47毫秒下降到分割为4000个工人时的约9毫秒。类似地，我们观察到numba正在选择旋转63个块，每个块64个线程，总共4032个线程，以处理这个“切片”所需的4000个工人。在

仍然不如~1msvectorize内核（它为工人提供了更多可用的并行“切片”），但比原问题中提出的~1.2s内核快得多。而且python代码的总体运行时间要快2倍，即使有python的开销。在

最后，让我们回顾一下我之前的陈述（与评论和其他回答中的陈述类似）：

"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."

现在我们在t16.py或t17.py中有了方便的测试用例，我们可以使用它们来测试这一点。为了简单起见，我选择t16.py。我们只需从guvectorizeufunc中删除目标名称，就可以“将此代码转换回CPU代码”：

$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741

real    0m0.528s
user    0m0.474s
sys     0m0.047s
$

因此，我们看到这个只有CPU的版本在大约6毫秒内运行函数，并且它没有GPU的“开销”，比如CUDA初始化和数据到GPU的复制。整体壁时也是我们最好的测量方法，大约为0.5秒，而我们最好的GPU案例是大约1.0秒。所以这个特殊的问题，由于它的低算术强度每字节的数据传输，可能不太适合GPU计算。在

网友
2楼 · 编辑于 2024-09-28 22:34:54

gufunc Numba发射和运行如此缓慢的原因在剖析时立即变得显而易见（Numba 0.38.1与cuda8.0）
==24691== Profiling application: python slowvec.py ==24691== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name 271.33ms 1.2800us - - - - - 8B 5.9605MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 271.65ms 14.591us - - - - - 156.25KB 10.213GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 272.09ms 2.5868ms - - - - - 15.259MB 5.7605GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 274.98ms 992ns - - - - - 8B 7.6909MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 275.17ms 640ns - - - - - 8B 11.921MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD] 276.33ms 657.28ms (1 1 1) (64 1 1) 40 0B 0B - - GeForce GTX 970 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38] 933.62ms 3.5128ms - - - - - 15.259MB 4.2419GB/s GeForce GTX 970 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block.
运行代码的最终内核启动使用64个线程的单个块。在一个GPU上，理论上每mp2048个线程，23mp，这意味着你的GPU理论处理能力的99.9%没有被使用。这看起来像是numba开发人员的一个荒谬的设计选择，如果你被它阻碍了，我会把它作为一个bug来报告（看起来你是这样）。在
显而易见的解决方案是将函数重写为cudapython内核方言中的@cuda.jit函数，并显式地控制执行参数。这样，您至少可以确保代码运行时有足够的线程来潜在地使用您的硬件的所有容量。它仍然是一个内存受限的操作，因此您可以实现的加速可能会被限制在远低于您的GPU的内存带宽与CPU的比率。而且，这可能不足以分摊主机到设备内存传输的成本，因此在最好的情况下，性能可能不会提高，尽管这还远远不够。在
简而言之，要小心automagic编译器生成的并行性的危险。。。。在
Postscript补充说，我设法弄清楚了如何获得numba发出的代码的PTX，并且足以说明这绝对是一个废话（而且我不能真正发布所有这些东西）：
^{pr2}$
所有这些整型运算都只执行一个双精度乘法！在

相关问题更多 >

编程相关推荐

热门问题

热门文章