我正在使用Pytork dataparallel和2个GPU。为什么我的型号在一个GPU上的状态命令是空的,而在另一个GPU上缺少键?

2024-09-26 22:10:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我对这个GitHub项目有一个问题:https://github.com/researchmm/TTSR 如果我在一个GPU上使用它,一切都会顺利运行。一旦我打开第二个GPU并使用torch.nn.DataParallel,这将导致“state_dict中缺少键”:

[2021-08-03 09:01:00,829] - [trainer.py file line:70] - INFO: Current epoch learning rate: 1.000000e-04
Traceback (most recent call last):
  File "/rwthfs/rz/cluster/home/ps815691/git/TTSR/main.py", line 53, in <module>
    t.train(current_epoch=epoch, is_init=False)
  File "/rwthfs/rz/cluster/home/ps815691/git/TTSR/trainer.py", line 126, in train
    sr_lv1, sr_lv2, sr_lv3 = self.model(sr=sr) 
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/rwthfs/rz/cluster/home/ps815691/git/TTSR/model/TTSR.py", line 32, in forward
    self.LTE_copy.load_state_dict(self.LTE.state_dict())#, strict=False) 
  File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LTE:
    Missing key(s) in state_dict: "slice1.0.weight", "slice1.0.bias", "slice2.2.weight", "slice2.2.bias", "slice2.5.weight", "slice2.5.bias", "slice3.7.weight", "slice3.7.bias", "slice3.10.weight", "slice3.10.bias". 

我打印了“LTE”和“LTE副本”的状态声明:

LTE GPU1    odict_keys([])
LTE GPU0    odict_keys(['sub_mean.weight', 'sub_mean.bias'])
LTE_Copy GPU1    odict_keys([])
LTE_Copy GPU0    odict_keys(['slice1.0.weight', 'slice1.0.bias', 'slice2.2.weight', 'slice2.2.bias', 'slice2.5.weight', 'slice2.5.bias', 'slice3.7.weight', 'slice3.7.bias', 'slice3.10.weight', 'slice3.10.bias', 'sub_mean.weight', 'sub_mean.bias'])

我不明白为什么会这样。让我简单介绍一下代码: 代码从main.py开始。首先,从model/ttsr.py初始化模型。该ttsr模型由多个子模型组成。其中之一是“LTE”&;“LTE_拷贝”。然后将该模型放入nn.DataParallel中,并用该模型初始化训练器(trainer.py)。t、 训练开始训练

_model = TTSR.TTSR(args).to(device)
_model = nn.DataParallel(_model, list(range(args.num_gpu)))
t = Trainer(args, _logger, _dataloader, _model, _loss_all)
t.train(current_epoch=epoch, is_init=True)

在训练功能中,在通过模型输入批次后,将模型输出反馈给模型,以获得损失函数的某些部分(trainer.py第97行)。然后,模型在ttsr.py中执行此代码:

### used in transferal perceptual loss
        self.LTE_copy.load_state_dict(self.LTE.state_dict())
        sr_lv1, sr_lv2, sr_lv3 = self.LTE_copy((sr + 1.) / 2.)
        return sr_lv1, sr_lv2, sr_lv3

有人知道为什么上面的错误消息会被抛出吗?如果我使用load\u state\u dict(…,strict=False),则不会出现这种情况,但这不就是忽略了潜在的问题吗?例如,GPU1的内存中似乎没有任何LTE.state命令


Tags: inpyselfhomeparallellinenndict

热门问题