Python Tensorflow DQN下一步

2024-09-27 21:32:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我不知道我的深度Q网络下一步该怎么做。我在努力优化公交线路。我有一个距离矩阵和数据停止流行

distance矩阵是一个2d数组,其中所有的stop都详细说明了它们之间的距离。如果有4个站点,则如下所示:

distance=np.array[[0, stop1-stop2, stop1-stop3, stop1-stop4],
                 [stop2-stop1, 0, stop2-stop3, stop2-stop4],
                 [stop3-stop1, stop3-stop2, 0, stop3-stop4],
                 [stop4-stop1, stop4-stop2, stop4-stop3, 0]]

rewards矩阵很简单:

(1/distance) * (percent of total riders who get on and off at specific stop)

这是为了确保短距离停车和高数量的乘客有最高的奖励价值

我为每个stop都做了一些课程。显示每个站点有多少人在等待,并定期更新站点中的更多人。当公共汽车“到达”一个站点时,它的waiters值变成0,因此它的奖励变成0,直到更多的人到达

我使用以下代码设置模型:

    import tensorflow as tf

    # Current game states. Rows of the rewards matrix corresponding to   the agent's current stop. Inputs to neural network.
    observations = tf.placeholder('float32', shape=[None, num_stops])

    # Actions. A number from 0-number of stops, denoting which stop the agent traveled to from its current location.
    actions = tf.placeholder('int32',shape=[None])

    # These are the rewards received by the agent for making its decisions. +1 if agent 'wins' the game (gets system score to 0 (this will only happen if bus stops are not updated periodically))
    rewards = tf.placeholder('float32',shape=[None])  # +1, -1 with discounts


# Model


    # This is first layer of neural network, takes the observations tensor as input and has '200' hidden layers. This number is arbitrary, I'm not sure how to adjust it for peak performance.
    Y = tf.layers.dense(observations, 200, activation=tf.nn.relu)

从这里我不知道该怎么办。我想分批运行神经网络,而不是在每次公交车行动(从一站到另一站)后更新权重。相反,我希望等到一个完整的“游戏”完成,例如,在游戏结束之前总线采取的预定数量的动作。如果比赛获胜,例如公共汽车在预定时间内到达每一站,将给予奖励。我想用1来保持简单。以前的行动将按折现率折现

我这样想是因为我想让代理看到其个人行为的长期影响。我在一篇关于代理学习玩乒乓球的论文中看到了这一点,我正在尝试实现一个类似的代理来玩我的系统。事先谢谢你的帮助


Tags: oftheto站点tf矩阵agentdistance

热门问题