  • 大型数据集(1000万x 5),(200 x 1000万x 5)
  • 大部分是裸体
  • 每次跑步后需要更长时间
  • 使用Spyder3
  • Windows 10


File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\", >line 82, in jitwrapper debug=debug)

TypeError: init() got an unexpected keyword argument 'debug'


os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"




Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this ?multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at:

我意识到我使用的是支持SLI的显卡(两个卡都是相同的,evga gtx 1080ti,并且具有相同的设备id),所以我禁用了SLI并添加了“CUDA_VISIBLE_DEVICES”行来尝试限制另一个卡,但结果相同。在




编辑: 下面是代码的一个工作示例:

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer

@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
print("Function: discount factor cumVest duration (seconds):" + str(timing))


可以制作一个numba/cuda guvectorize(或库达.jit内核)实现,它可能比简单的串行python实现运行得更快,但我怀疑是否有可能超过编写良好的宿主代码的性能(例如使用一些并行化方法,比如guvectorize)来完成同样的事情。这是因为在主机和设备之间传输的每字节的算术强度太低了。这个操作太简单了。在

其次,我相信,首先要理解numba vectorize和{}的意图。基本原则是从“一个工人将做什么”的角度来编写ufunc定义然后让numba从中产生多个工人。您指示numba启动多个worker的方法是传递一个大于您给出的签名的数据集。需要注意的是numba不知道如何在ufunc定义中并行for循环。它通过获取ufunc定义并在并行worker之间运行它来获得并行的“强度”,其中每个worker处理一个“片段”数据,但在该片段上运行整个ufunc定义。作为一些附加阅读,我也讨论了一些这方面的内容here。在




$ nvprof  print-gpu-trace python
==4145== NVPROF is profiling process 4145, command: python
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer

@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof  print-gpu-trace python
==4167== NVPROF is profiling process 4167, command: python
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy








  1. 代码更改涉及到删除countCol参数,从guvectorize内核中删除第一个for循环,并对函数签名进行适当的更改以反映这一点。我们还将签名中的三维函数修改为二维。毕竟,我们是从三维数据中提取一个二维的“切片”,让每个工人在一个切片上工作。

  2. 探查器报告的内核维度现在是2个块,而不是1个。这是有意义的,因为在最初的实现中,实际上只显示了1个“slice”,因此需要1个worker,因此需要1个线程(但是numba启动了64个线程的1个线程块)。在这个实现中,有100个slice,numba选择旋转两个64个worker/threads的线程块,以提供所需的100个worker/threads。

  3. 分析器报告的47.4ms内核性能现在介于原始版本(~1.224s)和大规模并行版本vectorize之间(在~0.001s)。因此,从1名员工增加到100名员工大大加快了工作进度,但仍有可能获得更多的绩效提升。如果您知道如何在countCol维度上进行切片,那么您可能会更接近vectorize版本的性能(见下文)。请注意,我们现在所处的位置(~47ms)和矢量化版本(~1ms)之间的差异足以弥补将稍大的multBy矩阵传输到设备的额外传输成本(~5ms,或更少),以简化vectorize


$ cat
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer

@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python
Function: discount factor cumVest duration (seconds):1.24376320839

real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python
Function: discount factor cumVest duration (seconds):0.0228319168091

real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python
Function: discount factor cumVest duration (seconds):0.0665760040283

real    0m1.252s
user    0m0.680s
sys     0m0.441s



$ cat
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer

@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[num] * discount[ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof  print-gpu-trace python
==8544== NVPROF is profiling process 8544, command: python
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python
==8544== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy




"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."


$ cat
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer

@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python
Function: discount factor cumVest duration (seconds):0.00657796859741

real    0m0.528s
user    0m0.474s
sys     0m0.047s


gufunc Numba发射和运行如此缓慢的原因在剖析时立即变得显而易见(Numba 0.38.1与cuda8.0)

==24691== Profiling application: python
==24691== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
271.33ms  1.2800us                    -               -         -         -         -        8B  5.9605MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
271.65ms  14.591us                    -               -         -         -         -  156.25KB  10.213GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
272.09ms  2.5868ms                    -               -         -         -         -  15.259MB  5.7605GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
274.98ms     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
275.17ms     640ns                    -               -         -         -         -        8B  11.921MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
276.33ms  657.28ms              (1 1 1)        (64 1 1)        40        0B        0B         -           -  GeForce GTX 970         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms  3.5128ms                    -               -         -         -         -  15.259MB  4.2419GB/s  GeForce GTX 970         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.







