回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我实现了<strong>Q-learning</strong>算法,并在OpenAI gym上的<strong>FrozenLake-v0</strong>上使用了它。
我在10000集的训练中获得185份奖励,在测试中获得7333份奖励。
这个好吗</p>
<p>我还尝试了<strong>Dyna-Q</strong>算法。但它的表现比Q-learning差。
培训期间的总奖励约为200,测试期间的总奖励约为700-900,共10000集,包含50个计划步骤</p>
<p>为什么会这样</p>
<p>下面是代码。代码有问题吗</p>
<pre><code># Setup
env = gym.make('FrozenLake-v0')
epsilon = 0.9
lr_rate = 0.1
gamma = 0.99
planning_steps = 0
total_episodes = 10000
max_steps = 100
</code></pre>
<p>培训和测试()</p>
<pre><code>while t < max_steps:
action = agent.choose_action(state)
state2, reward, done, info = agent.env.step(action)
# Removed in testing
agent.learn(state, state2, reward, action)
agent.model.add(state, action, state2, reward)
agent.planning(planning_steps)
# Till here
state = state2
</code></pre>
<pre><code>def add(self, state, action, state2, reward):
self.transitions[state, action] = state2
self.rewards[state, action] = reward
def sample(self, env):
state, action = 0, 0
# Random visited state
if all(np.sum(self.transitions, axis=1)) <= 0:
state = np.random.randint(env.observation_space.n)
else:
state = np.random.choice(np.where(np.sum(self.transitions, axis=1) > 0)[0])
# Random action in that state
if all(self.transitions[state]) <= 0:
action = np.random.randint(env.action_space.n)
else:
action = np.random.choice(np.where(self.transitions[state] > 0)[0])
return state, action
def step(self, state, action):
state2 = self.transitions[state, action]
reward = self.rewards[state, action]
return state2, reward
def choose_action(self, state):
if np.random.uniform(0, 1) < epsilon:
return self.env.action_space.sample()
else:
return np.argmax(self.Q[state, :])
def learn(self, state, state2, reward, action):
# predict = Q[state, action]
# Q[state, action] = Q[state, action] + lr_rate * (target - predict)
target = reward + gamma * np.max(self.Q[state2, :])
self.Q[state, action] = (1 - lr_rate) * self.Q[state, action] + lr_rate * target
def planning(self, n_steps):
# if len(self.transitions)>planning_steps:
for i in range(n_steps):
state, action = self.model.sample(self.env)
state2, reward = self.model.step(state, action)
self.learn(state, state2, reward, action)
</code></pre>