为什么在GPU上设置值会降低计算速度？

import torch A = torch.rand(600, 600, device='cuda:0') row0 = torch.tensor(100, device='cuda:0') col0 = torch.tensor(100, device='cuda:0') row1 = torch.tensor(356, device='cuda:0') col1 = torch.tensor(356, device='cuda:0') B = torch.rand(256, 256, device='cuda:0') a = 10 %timeit B[:] = A[row0:row1, col0:col1] # 395 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit a*A + a**2 # 17 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) A = torch.rand(600, 600, device='cuda:0') row0 = 100 col0 = 100 row1 = 356 col1 = 356 B = torch.rand(256, 256, device='cuda:0') a1 = torch.as_tensor(a).cuda() %timeit B[:] = A[row0:row1, col0:col1] # 10.6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) %timeit a1*A + a1**2 # 30.2 µs ± 584 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

1条回答

网友

1楼 · 发布于 2024-06-30 15:46:04

你的代码很慢，因为没有什么东西可以并行化，而且你只需要承担不必要的GPU开销

GPU并行通过启动大量线程，同时计算某些操作的块来工作。像矩阵乘法和卷积这样的东西对GPU非常友好，因为您可以将它们分解为许多类似的较小操作

但是，在GPU上执行操作时也会有开销

只有当启动了足够数量的线程以超过CUDA开销时，我们才能观察到加速。让我们看一个例子：

import torch

device = torch.device('cuda:0')

A = torch.randn(5, 10, device=device)
B = torch.randn(10, 5, device=device)
A_ = torch.randn(5, 10)
B_ = torch.randn(10, 5)

%timeit A @ B
# 10.5 µs ± 745 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit A_ @ B_
# 5.21 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

你可能认为这违背了常识，CPU矩阵乘法怎么能比GPU的乘法更快呢？这仅仅是因为我们还没有一个足够大的操作来并行化。让我们使用重试相同的操作，但在较大的输入上：

A = torch.randn(100, 200, device=device)
B = torch.randn(200, 100, device=device)
A_ = torch.randn(100, 200)
B_ = torch.randn(200, 100)

%timeit A @ B
# 10.4 µs ± 333 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit A_ @ B_
# 45.3 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我们将输入大小增加了20倍。GPU基本上仍在向我们显示相同的时间、开销，而CPU时间却大幅增加。因为输入更大，GPU并行性可以显示它的魔力

在您的情况下，您根本没有进行任何并行化。您只需尝试使用GPU标量分割张量，从而获得某种开销，但没有任何好处。在另一个操作中也有类似的情况：没有任何东西可以并行化

%timeit a**2
# 200 ns ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a1**2
# 16.8 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

无法将操作a1 ** 2分解为更小的可重复块。知道何时以及何时不使用GPU是非常重要的This也可以作为了解CUDA如何在引擎盖下工作的有用起点

相关问题更多 >

编程相关推荐

热门问题

热门文章