我目前正在阅读Reinforcement Learning: An Introduction(RL:AI),并尝试用一个n武装的强盗和简单的奖励平均来重现第一个例子。你知道吗
平均值
new_estimate = current_estimate + 1.0 / step * (reward - current_estimate)
为了从PDF中复制图表,我生成了2000个强盗游戏,让不同的代理玩2000个强盗1000步(如PDF中所述),然后平均奖励以及最佳行动的百分比。你知道吗
在PDF中,结果如下所示:
然而,我无法重现这一点。如果我使用简单的平均法,所有有探索(epsilon > 0
)的代理实际上比没有探索的代理玩得更糟。这很奇怪,因为探索的可能性应该允许代理更频繁地离开局部最优,并接触到更好的行动。你知道吗
正如您在下面看到的,我的实现不是这样的。还要注意,我添加了使用加权平均的代理。但即使在这种情况下,提高epsilon
也会导致代理性能的下降。你知道吗
你知道我的代码有什么问题吗?你知道吗
from abc import ABC
from typing import List
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from multiprocessing.pool import Pool
class Strategy(ABC):
def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
raise NotImplementedError()
class Averaging(Strategy):
def __str__(self):
return 'avg'
def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
current = estimates[action]
return current + 1.0 / step * (reward - current)
class WeightedAveraging(Strategy):
def __init__(self, alpha):
self.alpha = alpha
def __str__(self):
return 'weighted-avg_alpha=%.2f' % self.alpha
def update_estimates(self, step: int, estimates: List[float], action: int, reward: float):
current = estimates[action]
return current + self.alpha * (reward - current)
class Agent:
def __init__(self, nb_actions, epsilon, strategy: Strategy):
self.nb_actions = nb_actions
self.epsilon = epsilon
self.estimates = np.zeros(self.nb_actions)
self.strategy = strategy
def __str__(self):
return ','.join(['eps=%.2f' % self.epsilon, str(self.strategy)])
def get_action(self):
best_known = np.argmax(self.estimates)
if np.random.rand() < self.epsilon and len(self.estimates) > 1:
explore = best_known
while explore == best_known:
explore = np.random.randint(0, len(self.estimates))
return explore
return best_known
def update_estimates(self, step, action, reward):
self.estimates[action] = self.strategy.update_estimates(step, self.estimates, action, reward)
def reset(self):
self.estimates = np.zeros(self.nb_actions)
def play_bandit(agent, nb_arms, nb_steps):
agent.reset()
bandit_rewards = np.random.normal(0, 1, nb_arms)
rewards = list()
optimal_actions = list()
for step in range(1, nb_steps + 1):
action = agent.get_action()
reward = bandit_rewards[action] + np.random.normal(0, 1)
agent.update_estimates(step, action, reward)
rewards.append(reward)
optimal_actions.append(np.argmax(bandit_rewards) == action)
return pd.DataFrame(dict(
optimal_actions=optimal_actions,
rewards=rewards
))
def main():
nb_tasks = 2000
nb_steps = 1000
nb_arms = 10
fig, (ax_rewards, ax_optimal) = plt.subplots(2, 1, sharex='col', figsize=(8, 9))
pool = Pool()
agents = [
Agent(nb_actions=nb_arms, epsilon=0.00, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.01, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.10, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.00, strategy=WeightedAveraging(0.5)),
Agent(nb_actions=nb_arms, epsilon=0.01, strategy=WeightedAveraging(0.5)),
Agent(nb_actions=nb_arms, epsilon=0.10, strategy=WeightedAveraging(0.5)),
]
for agent in agents:
print('Agent: %s' % str(agent))
args = [(agent, nb_arms, nb_steps) for _ in range(nb_tasks)]
results = pool.starmap(play_bandit, args)
df_result = sum(results) / nb_tasks
df_result.rewards.plot(ax=ax_rewards, label=str(agent))
df_result.optimal_actions.plot(ax=ax_optimal)
ax_rewards.set_title('Rewards')
ax_rewards.set_ylabel('Average reward')
ax_rewards.legend()
ax_optimal.set_title('Optimal action')
ax_optimal.set_ylabel('% optimal action')
ax_optimal.set_xlabel('steps')
plt.xlim([0, nb_steps])
plt.show()
if __name__ == '__main__':
main()
在更新规则的公式中
参数
step
应该是特定action
被执行的次数,而不是模拟的总步数。因此,您需要将该变量与操作值一起存储,以便将其用于更新。你知道吗这也可以从第2.4章增量实现末尾的伪代码框中看到:
(来源:Richard S.Sutton和Andrew G.Barto:强化学习-简介,第二版,2018年,第2.4章增量实施)
相关问题 更多 >
编程相关推荐