y-aprop的Karpathy-Pong交叉熵/对数损失解释

# forward the policy network and sample an action from the returned probability #########action 2 is up and 3 is down aprob, h = policy_forward(x) print("aprob\n {}\n h\n {}\n".format(aprob, h)) #2 is up, 3 is down action = 2 if np.random.uniform() < aprob else 3 # roll the dice! print("action\n {}\n".format(action)) # record various intermediates (needed later for backprop) xs.append(x) # observation, ie. the difference frame? #print("xs {}".format(xs)) hs.append(h) # hidden state obtained from forward pass #print("hs {}".format(hs)) #if action is up, y = 1, else 0 y = 1 if action == 2 else 0 # a "fake label" print("y \n{}\n".format(y)) dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused) print("dlogps\n {}\n".format(dlogps)) # step the environment and get new measurements observation, reward, done, info = env.step(action) print("observation\n {}\n reward\n {}\n done\n {}\n ".format(observation, reward, done)) reward_sum += reward print("reward_sum\n {}\n".format(reward_sum)) drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action) print("drs\n {}\n".format(drs)) if done: # an episode finished episode_number += 1

1条回答

网友

1楼 · 发布于 2024-09-30 16:31:34

我对他如何到达的解释：

当他向前通过他的网络时，最后一步是对最后一个神经元的输出应用sigmoids（x）。在

S(x) = 1 / (1+e^-x)

以及它的梯度

^{pr2}$

为了增加/减少你行动的可能性，你必须计算你的“标签”的概率日志

L = log p(y|x)

为了反向传播，你必须计算你的似然L的梯度

grad L = grad log p(y|x)

因为在输出上应用了sigmoid函数p=S（y），所以实际上是在计算

grad L = grad log S(y)   
grad L = 1 / S(y) * S(y)(1-(S(y))  
grad L = (1-S(y))  
**grad L = (1-p)**

这实际上只不过是对数损失/交叉熵。更一般的公式是：

L = - (y log p + (1-y)log(1-p))  
grad L = y-p with y either 0 or 1

由于Andrej在他的例子中没有使用像Tensorflow或PyTorch这样的框架，所以他在那里做了一些反向传播。在

一开始我也很困惑，我花了一些时间才弄清楚那里到底有什么魔法。也许他本可以说得更清楚一些，并给出一些提示。在

至少这是我对他的准则的拙劣理解：）

相关问题更多 >

编程相关推荐

热门问题

热门文章