基于张量流的二元交叉熵反向传播

2024-06-25 07:07:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试实现关于强化学习的this gist的TensorFlow版本。基于评论,它使用来自logits的二进制交叉熵。我尝试使用tf.keras.losses.binary_crossentropy,但在相同的输入和初始权重下,它会产生完全不同的梯度。在训练期间,tensorflow版本的表现很糟糕,而且根本没有学习,所以它肯定有问题,但无法找出原因。 看看我做的测试:

x_size = 2
h_size = 3
y_size = 1
rms_discount = 0.99
epsilon = 1e-7
learning_rate = 0.001

x = np.arange(x_size).astype('float32').reshape([1, -1])
y = np.zeros([1, y_size]).astype('float32')
r = np.ones([1, 1]).astype('float32')

wh1 = np.arange(x_size * h_size).astype('float32').reshape([x_size, h_size])
wy1 = np.arange(h_size * y_size).astype('float32').reshape([h_size, y_size])

cache_wh1 = np.zeros_like(wh1)
cache_wy1 = np.zeros_like(wy1)

optimizer = tf.keras.optimizers.RMSprop(learning_rate, rms_discount, epsilon=epsilon)

wh2 = tf.keras.layers.Dense(
  h_size,
  'relu',
  False,
  tf.keras.initializers.constant(wh1)
)
wy2 = tf.keras.layers.Dense(
  y_size,
  None,
  False,
  tf.keras.initializers.constant(wy1)
)

cache_wh2 = np.zeros_like(wh1)
cache_wy2 = np.zeros_like(wy1)

for i in range(100):
  h1 = np.matmul(x, wh1)
  h1[h1 < 0] = 0.
  y_pred1 = np.matmul(h1, wy1)

  dCdy = -(y - y_pred1)
  dCdwy = np.matmul(h1.T, dCdy)
  dCdh = np.matmul(dCdy, wy1.T)
  dCdh[h1 < 0] = 0
  dCdwh = np.matmul(x.T, dCdh)

  gradients1 = [dCdwh, dCdwy]

  cache_wh1 = rms_discount * cache_wh1 + (1 - rms_discount) * dCdwh**2
  wh1 -= learning_rate * dCdwh / (np.sqrt(cache_wh1) + epsilon)

  cache_wy1 = rms_discount * cache_wy1 + (1 - rms_discount) * dCdwy**2
  wy1 -= learning_rate * dCdwy / (np.sqrt(cache_wy1) + epsilon)

  with tf.GradientTape() as tape:
    h2 = wh2(x)
    y_pred2 = wy2(h2)

    loss = tf.keras.losses.binary_crossentropy(y, y_pred2, from_logits=True)

  gradients2 = tape.gradient(loss, wh2.trainable_variables + wy2.trainable_variables)

  cache_wh2 = rms_discount * cache_wh2 + (1 - rms_discount) * gradients2[0]**2
  wh2.set_weights(wh2.get_weights() - learning_rate * gradients2[0] / (np.sqrt(cache_wh2) + epsilon))

  cache_wy2 = rms_discount * cache_wy2 + (1 - rms_discount) * gradients2[1]**2
  wy2.set_weights(wy2.get_weights() - learning_rate * gradients2[1] / (np.sqrt(cache_wy2) + epsilon))

  print('1', gradients1[0])
  print('1', gradients1[1])
  print('2', gradients2[0])
  print('2', gradients2[1])

成本/损失相对于y(pred)的偏导数是相同的,因此剩余的应该是标准的反向传播,只是使用RMSprop。但他们的表现不同。为什么?


Tags: cachesizeratetfnpdiscountkeraslearning