为什么批次规范化会使我的批次如此异常？问题的回答

为什么批次规范化会使我的批次如此异常？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我第一次玩Pytork，我注意到在训练我的神经网络时，大约四分之一左右的损失会向左转向无穷远，然后很快就会消失。我看到了一些关于南宁的其他问题，但那里的建议似乎基本上是为了实现标准化；但是在我的网络下面的第一层是这样一个规范化，我仍然看到这个问题！完整的网络是<a href="https://github.com/dmwit/nurse-sveta/blob/803b0a2a17c5d82454a0765e300c440d6e5f1f5e/nsaid/nsaid/__main__.py" rel="nofollow noreferrer">a bit convoluted</a>，但是我已经做了一些调试，试图生成一个非常小的、可以理解的网络，它仍然显示相同的问题 代码如下；它由16个输入（0-1）组成，这些输入通过批处理规范化，然后通过完全连接的层传递到单个输出。我想让它学习总是输出1的函数，所以我取1的平方误差作为损失 <pre><code>import torch as t import torch.nn as tn import torch.optim as to if __name__ == '__main__': board = t.rand([1,1,1,16]) net = tn.Sequential \ ( tn.BatchNorm2d(1) , tn.Conv2d(1, 1, [1,16]) ) optimizer = to.SGD(net.parameters(), lr=0.1) for i in range(10): net.zero_grad() nn_outputs = net.forward(board) loss = t.sum((nn_outputs - 1)**2) print(i, nn_outputs, loss) loss.backward() optimizer.step() </code></pre> 如果您运行几次，最终会看到如下运行： <pre><code>0 tensor([[[[-0.7594]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(3.0953, grad_fn=<SumBackward0>) 1 tensor([[[[4.0954]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(9.5812, grad_fn=<SumBackward0>) 2 tensor([[[[5.5210]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(20.4391, grad_fn=<SumBackward0>) 3 tensor([[[[-3.4042]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(19.3966, grad_fn=<SumBackward0>) 4 tensor([[[[823.6523]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(676756.7500, grad_fn=<SumBackward0>) 5 tensor([[[[3.5471e+08]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(1.2582e+17, grad_fn=<SumBackward0>) 6 tensor([[[[2.8560e+25]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(inf, grad_fn=<SumBackward0>) 7 tensor([[[[inf]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(inf, grad_fn=<SumBackward0>) 8 tensor([[[[nan]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(nan, grad_fn=<SumBackward0>) 9 tensor([[[[nan]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(nan, grad_fn=<SumBackward0>) </code></pre> 为什么我的损失归南，我能做些什么

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

欢迎来到pytorch 以下是我如何安排您的培训。请检查评论 <pre><code># how the comunity usually does the import: import torch # some people do: import torch as th import torch.nn as nn import torch.optim as optim if __name__ == '__main__': # setting some parameters: batch_size = 32 n_dims = 128 # select GPU if available device = 'cuda' if torch.cuda.is_available() else 'cpu' # initializing a simple neural net net = nn.Sequential(nn.Linear(n_dims, n_dims // 2), # Batch norm is not usually used directly on the input nn.BachNorm1d(n_dims // 2), # Batch norm is used before the activation function (it centers the input and helps make the dims of the previous layers independent of each other) nn.ReLU(), # the most common activation function nn.Linear(n_dims // 2, 1) # final layer) net.to(device) # model is copied to the GPU if it is availalbe optimizer = to.SGD(net.parameters(), lr=0.01) # it is better to start with a low lr and increase it at later experiments to avoid training divergence, the range [1.e-6, 5.e-2] is recommended. for i in range(10): # generating random data: board = torch.rand([batch_size, n_dims]) # for sequences: [batch_size, channels, L] # for image data: [batch_size, channels, W, H] # for videos: [batch_size, chanels, L, W, H] boad = board.to(device) # data is copied to the gpu if it is available optimizer.zero_grad() # the convension the comunity uses, though the result is the same as net.zero_grad() nn_outputs = net(board) # don't call net.forward(x), call net(x). Pytorch applies some hooks in the net.__call__(x) that are useful for backpropagation. loss = ((nn_outputs - 1)**2).mean() # using .mean() makes your training less sensitive to the batch size. print(i, nn_outputs, loss.item()) loss.backward() optimizer.step() </code></pre> 关于批处理规范的一条评论。对于每个维度，它计算批次的平均值和标准偏差（查看文档<a href="https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d" rel="nofollow noreferrer">https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d</a>）： <pre><code>x_normalized = (x.mean(dim=0) / (x.std(dim=0) + e-6)) * scale + shift </code></pre> 其中，缩放和平移是可学习的参数。如果每个批只给出一个示例，<code>x.std(0) = 0</code>将使<code>x_normalized</code>包含非常大的值

为什么批次规范化会使我的批次如此异常？

1 个回答

相关Python问题