
2024-06-30 15:46:04 发布

您现在位置:Python中文网/ 问答频道 /正文


import torch

A = torch.rand(600, 600, device='cuda:0')
row0 = torch.tensor(100, device='cuda:0')
col0 = torch.tensor(100, device='cuda:0')
row1 = torch.tensor(356, device='cuda:0')
col1 = torch.tensor(356, device='cuda:0')
B = torch.rand(256, 256, device='cuda:0')
a = 10

%timeit B[:] = A[row0:row1, col0:col1]
# 395 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a*A + a**2
# 17 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

A = torch.rand(600, 600, device='cuda:0')
row0 = 100
col0 = 100
row1 = 356
col1 = 356
B = torch.rand(256, 256, device='cuda:0')
a1 = torch.as_tensor(a).cuda()

%timeit B[:] = A[row0:row1, col0:col1]
# 10.6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a1*A + a1**2
# 30.2 µs ± 584 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Tags: devloopdevicetorchmeancudacol1std
1楼 · 发布于 2024-06-30 15:46:04





import torch

device = torch.device('cuda:0')

A = torch.randn(5, 10, device=device)
B = torch.randn(10, 5, device=device)
A_ = torch.randn(5, 10)
B_ = torch.randn(10, 5)

%timeit A @ B
# 10.5 µs ± 745 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit A_ @ B_
# 5.21 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


A = torch.randn(100, 200, device=device)
B = torch.randn(200, 100, device=device)
A_ = torch.randn(100, 200)
B_ = torch.randn(200, 100)

%timeit A @ B
# 10.4 µs ± 333 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit A_ @ B_
# 45.3 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)



%timeit a**2
# 200 ns ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a1**2
# 16.8 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

无法将操作a1 ** 2分解为更小的可重复块。知道何时以及何时不使用GPU是非常重要的This也可以作为了解CUDA如何在引擎盖下工作的有用起点

相关问题 更多 >