<p>首先,您所展示的基本操作是获取两个矩阵,将它们传输到GPU,进行一些元素乘法以生成第三个数组,然后将第三个数组传递回主机。在</p>
<p>可以制作一个numba/cuda guvectorize(或库达.jit内核)实现,它可能比简单的串行python实现运行得更快,但我怀疑是否有可能超过编写良好的宿主代码的性能(例如使用一些并行化方法,比如guvectorize)来完成同样的事情。这是因为在主机和设备之间传输的每字节的算术强度太低了。这个操作太简单了。在</p>
<p>其次,我相信,首先要理解numba <code>vectorize</code>和{<cd2>}的意图。基本原则是从“一个工人将做什么”的角度来编写ufunc定义然后让numba从中产生多个工人。您指示numba启动多个worker的方法是传递一个大于您给出的签名的数据集。需要注意的是<em>numba不知道如何在ufunc定义</em>中并行for循环。它通过获取ufunc定义并在并行worker之间运行它来获得并行的“强度”,其中每个worker处理一个“片段”数据,但在该片段上运行整个ufunc定义</em>。作为一些附加阅读,我也讨论了一些这方面的内容<a href="https://stackoverflow.com/questions/49646182/numba-vectorize-for-cuda-what-is-the-correct-signature-to-return-arrays/49661183#49661183">here</a>。在</p>
<p>因此,我们在实现中遇到的一个问题是,您编写了一个签名(和ufunc),它将整个输入数据集映射到单个worker。正如@talonmies所显示的那样,你的底层内核总共有64个线程/工作线程(这在GPU上是非常小的,即使上面关于算术强度的声明除外),但我怀疑64实际上只是numba最小线程块大小,而实际上该线程块中只有1个线程在运行任何有用的计算工作。一个线程正在以串行方式执行整个ufunc,包括所有for循环。在</p>
<p>显然,这不是任何人想要合理使用<code>vectorize</code>或{<cd2>}的。在</p>
<p>所以让我们重新审视一下你想做什么。最终,ufunc希望将一个数组的输入值乘以另一个数组的输入值,并将结果存储到第三个数组中。我们想多次重复这个过程。如果所有3个数组大小都相同,我们实际上可以用<code>vectorize</code>实现这一点,甚至不必求助于更复杂的<code>guvectorize</code>。让我们将该方法与您的原始方法进行比较,重点是CUDA内核的执行。下面是一个有效的示例,其中t14.py是您的原始代码,使用profiler运行,t15.py是它的<code>vectorize</code>版本,承认我们已经更改了<code>multBy</code>数组的大小,以匹配<code>cv</code>和<code>discount</code>:</p>
<pre><code>$ nvprof print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
312.36ms 1.2160us - - - - - 8B 6.2742MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.81ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.52ms 5.8696ms - - - - - 15.259MB 2.5387GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.74ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.93ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
321.40ms 1.22538s (1 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s 7.1816ms - - - - - 15.259MB 2.0749GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
193.92ms 6.2729ms - - - - - 15.259MB 2.3755GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
201.09ms 5.7101ms - - - - - 15.259MB 2.6096GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
364.92ms 842.49us (15625 1 1) (128 1 1) 13 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms 7.1528ms - - - - - 15.259MB 2.0833GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
</code></pre>
<p>我们看到,您的应用程序报告的运行时间约为1.244秒,而矢量化版本报告的运行时间约为0.375秒。但这两个数字都有python开销。如果我们看一下分析器中生成的CUDA内核持续时间,差别就更加明显了。我们看到原始内核大约需要1.225秒,而向量化内核的执行时间大约为842微秒(即不到1毫秒)。我们还注意到,计算内核时间现在比从GPU传输3个数组所需的时间要小得多(总共需要20毫秒),而且我们还注意到内核的大小现在是15625个块,每个块有128个线程,线程/工作线程总数为2000000,完全匹配要完成的乘法操作的总数,并且远远超过原始代码中微不足道的64个线程(可能,实际上只有1个线程)。在</p>
<p>鉴于上述<code>vectorize</code>方法的简单性,如果你真正想做的是这个元素级乘法,然后你可以考虑复制<code>multBy</code>,这样它就可以在维度上匹配另外两个数组,然后用它来完成。在</p>
<p>但问题仍然存在:如何处理不同的输入数组大小,就像原来的问题一样?为此,我认为我们需要转到<code>guvectorize</code>(或者,正如@talonmies所指出的那样,编写您自己的<code>@cuda.jit</code>内核,这可能是最好的建议,尽管这些方法都无法克服向设备/从设备传输数据的开销,如前所述)。在</p>
<p>为了用<code>guvectorize</code>来解决这个问题,我们需要更仔细地考虑前面提到的“切片”概念。让我们重新编写您的<code>guvectorize</code>内核,使其只对整个数据的“片段”进行操作,然后允许<code>guvectorize</code>启动函数启动多个worker来处理它,每个片一个worker。在</p>
<p>在CUDA,我们喜欢有很多工人,你真的不能有太多。因此,这将影响我们如何“切片”数组,从而为多个工作线程提供操作的可能性。如果我们沿着第三个维度(最后一个维度,<code>n</code>)切片,我们将只有5个切片来处理,因此最多有5个工人。同样地,如果我们沿着第一个维度或者<code>countRow</code>维度切片,我们将有100个切片,因此最多有100个工人。理想情况下,我们将沿着第二维度或<code>countCol</code>维度进行切片。不过,为了简单起见,我将沿着第一个维度,即<code>countRow</code>维度进行切片。这显然是非最优的,但请参阅下面的示例,以了解如何处理按二维进行切片的问题。按第一维度切片意味着我们将从guvectorize内核中删除第一个for循环,并允许ufunc系统沿着该维度并行化(基于我们传递的数组的大小)。代码可能如下所示:</p>
^{pr2}$
<p>观察结果:</p>
<ol>
<li><p>代码更改涉及到删除<code>countCol</code>参数,从guvectorize内核中删除第一个for循环,并对函数签名进行适当的更改以反映这一点。我们还将签名中的三维函数<em>修改为二维。毕竟,我们是从三维数据中提取一个二维的“切片”,让每个工人在一个切片上工作。</p></li>
<li><p>探查器报告的内核维度现在是2个块,而不是1个。这是有意义的,因为在最初的实现中,实际上只显示了1个“slice”,因此需要1个worker,因此需要1个线程(但是numba启动了64个线程的1个线程块)。在这个实现中,有100个slice,numba选择旋转两个64个worker/threads的线程块,以提供所需的100个worker/threads。</p></li>
<li><p>分析器报告的47.4ms内核性能现在介于原始版本(~1.224s)和大规模并行版本<code>vectorize</code>之间(在~0.001s)。因此,从1名员工增加到100名员工大大加快了工作进度,但仍有可能获得更多的绩效提升。如果您知道如何在<code>countCol</code>维度上进行切片,那么您可能会更接近<code>vectorize</code>版本的性能(见下文)。请注意,我们现在所处的位置(~47ms)和矢量化版本(~1ms)之间的差异足以弥补将稍大的<code>multBy</code>矩阵传输到设备的额外传输成本(~5ms,或更少),以简化<code>vectorize</code>。</p></li>
</ol>
<p>关于python计时的一些补充意见:我相信python为原始版本、向量化版本和guvectorize改进版本编译必要内核的具体行为是不同的。如果我们修改t15.py代码来运行“预热”运行,那么至少python的计时是一致的,从趋势上看,它与整个墙时间和仅内核的计时是一致的:</p>
<pre><code>$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839
real 0m2.522s
user 0m1.572s
sys 0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091
real 0m1.050s
user 0m0.473s
sys 0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283
real 0m1.252s
user 0m0.680s
sys 0m0.441s
$
</code></pre>
<p>现在,有效地回答了评论中的一个问题:“如何我是否要重新计算问题以沿4000(<code>countCol</code>,或“middle”)维度切片在</p>
<p>我们可以根据第一维度的工作原理来指导我们。一种可能的方法是重新排列数组的形状,使4000维成为第一维,然后删除它,类似于我们在前面处理<code>guvectorize</code>时所做的。下面是一个有效的例子:</p>
<pre><code>$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[num] * discount[ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
</code></pre>
<p>在某种程度上可以预见,我们观察到,执行时间从分割为100个工人时的约47毫秒下降到分割为4000个工人时的约9毫秒。类似地,我们观察到numba正在选择旋转63个块,每个块64个线程,总共4032个线程,以处理这个“切片”所需的4000个工人。在</p>
<p>仍然不如~1ms<code>vectorize</code>内核(它为工人提供了更多可用的并行“切片”),但比原问题中提出的~1.2s内核快得多。而且python代码的总体运行时间要快2倍,即使有python的开销。在</p>
<p>最后,让我们回顾一下我之前的陈述(与评论和其他回答中的陈述类似):</p>
<blockquote>
<p>"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."</p>
</blockquote>
<p>现在我们在t16.py或t17.py中有了方便的测试用例,我们可以使用它们来测试这一点。为了简单起见,我选择t16.py。我们只需从<code>guvectorize</code>ufunc中删除目标名称,就可以“将此代码转换回CPU代码”:</p>
<pre><code>$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741
real 0m0.528s
user 0m0.474s
sys 0m0.047s
$
</code></pre>
<p>因此,我们看到这个只有CPU的版本在大约6毫秒内运行函数,并且它没有GPU的“开销”,比如CUDA初始化和数据到GPU的复制。整体壁时也是我们最好的测量方法,大约为0.5秒,而我们最好的GPU案例是大约1.0秒。所以这个特殊的问题,由于它的低算术强度每字节的数据传输,可能不太适合GPU计算。在</p>