在使用torch.distributed.init\u进程时，没有更改GPU的选项

2024-05-07 18:50:21 发布

2076

男 | 程序猿一只，喜欢编程写python代码。

当我们使用混合精度在Imagenet上运行the NVIDIA code进行培训时，请执行以下操作：

$ python -m torch.distributed.launch --nproc_per_node=n main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./

一旦执行torch.distributed.init_process_group(backend='nccl', init_method='env://')，Pytorch就会在n个GPU上生成n个进程，如下参数所示torch.distributed.launch --nproc_per_node=n。所有这些过程都从0-th索引开始，一直到n-1。不幸的是，除了从零开始或选择可选的GPU之外，没有办法选择GPU的索引。我还尝试使用以下方法：

$ CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node=4 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./

我得到了torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal

我还将os.environ['CUDA_VISIBLE_DEVICES']更改为一个4,5,6,7字符串，但得到了与上面相同的错误

你能帮我做这个吗

我的猜测是，我们需要以某种方式更改torch.distributed.init_process_group的源代码，让它选择不从零开始索引的GPU

Tags： py node gpu init main torch launch cuda

0条回答

目前没有回答

在使用torch.distributed.init\u进程时，没有更改GPU的选项

相关问题更多 >

编程相关推荐

热门问题

热门文章

在使用torch.distributed.init\u进程时，没有更改GPU的选项

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >