我正试图使用VowpalWabbit(本文后面的vw tutorial)优化给定设备类型(上下文)的某些文章或广告(操作)的点击率。但是,我无法使其可靠地收敛到最优操作
我创建了一个最小的工作示例(很抱歉这么长):
import random
import numpy as np
from matplotlib import pyplot as plt
from vowpalwabbit import pyvw
plt.ion()
action_space = ["article-1", "article-2", "article-3"]
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / float(N)
def to_vw_example_format(context, cb_label=None):
if cb_label is not None:
chosen_action, cost, prob = cb_label
example_string = ""
example_string += "shared |User device={} \n".format(context)
for action in action_space:
if cb_label is not None and action == chosen_action:
example_string += "1:{}:{} ".format(cost, prob)
example_string += "|Action ad={} \n".format(action)
# Strip the last newline
return example_string[:-1]
# definition of problem to solve, playing out the article with highest ctr given a context
context_to_action_ctr = {
"device-1": {"article-1": 0.05, "article-2": 0.06, "article-3": 0.04},
"device-2": {"article-1": 0.08, "article-2": 0.07, "article-3": 0.05},
"device-3": {"article-1": 0.01, "article-2": 0.04, "article-3": 0.09},
"device-4": {"article-1": 0.04, "article-2": 0.04, "article-3": 0.045},
"device-5": {"article-1": 0.09, "article-2": 0.01, "article-3": 0.07},
"device-6": {"article-1": 0.03, "article-2": 0.09, "article-3": 0.04}
}
#vw = f"--cb_explore 3 -q UA -q UU --epsilon 0.1"
vw = f"--cb_explore_adf -q UA -q UU --bag 5 "
#vw = f"--cb_explore_adf -q UA --epsilon 0.2"
actor = pyvw.vw(vw)
random_rewards = []
actor_rewards = []
optimal_rewards = []
for step in range(200000):
# pick a random context
device = random.choice(list(context_to_action_ctr.keys()))
# let vw generate probability distribution
# action_probabilities = np.array(actor.predict(f"|x device:{device}"))
action_probabilities = np.array(actor.predict(to_vw_example_format(device)))
# sample action
probabilities = action_probabilities / action_probabilities.sum()
action_idx = np.random.choice(len(probabilities), 1, p=probabilities)[0]
probability = action_probabilities[action_idx]
# get reward/regret
action_to_reward_regret = {
action: (1, 0) if random.random() < context_to_action_ctr[device][action] else (0, 1) for action in action_space
}
actor_action = action_space[action_idx]
random_action = random.choice(action_space)
optimal_action = {
"device-1": "article-2",
"device-2": "article-1",
"device-3": "article-3",
"device-4": "article-3",
"device-5": "article-1",
"device-6": "article-2",
}[device]
# update statistics
actor_rewards.append(action_to_reward_regret[actor_action][0])
random_rewards.append(action_to_reward_regret[random_action][0])
optimal_rewards.append(action_to_reward_regret[optimal_action][0])
# learn online
reward, regret = action_to_reward_regret[actor_action]
cost = -1 if reward == 1 else 0
# actor.learn(f"{action_idx+1}:{cost}:{probability} |x device:{device}")
actor.learn(to_vw_example_format(device, (actor_action, cost, probability)))
if step % 100 == 0 and step > 1000:
plt.clf()
axes = plt.gca()
plt.title("Reward over time")
plt.plot(running_mean(actor_rewards, 10000), label=str(vw))
plt.plot(running_mean(random_rewards, 10000), label="Random actions")
plt.plot(running_mean(optimal_rewards, 10000), label="Optimal actions")
plt.legend()
plt.pause(0.0001)
本质上,有三种可能的行动(第1-3条)和6种情境(设备1-6),每种组合都有特定的点击率(点击率)和给定情境的最佳行动(设备具有最高点击率的文章)。在每次迭代中,对随机上下文进行采样,并计算每个动作的奖励/后悔。如果奖励为1(用户单击),VowpalWabbit用于学习的成本为-1;如果奖励为0(用户未单击),则成本为0。随着时间的推移,该算法应该能够为每个设备找到最佳的文章
问题是:
由于CTR相当小,需要进行大量的播放以实现收敛,因此我理解问题的难度。然而,我希望随着时间的推移,算法会找到最佳的
我是不是错过了VowpalWabbit的配置
如果所有策略都一致,那么不带epsilon参数的“bag5”就可以开始产生零概率
我对你的代码做了一些修改,以跟踪actor.predict中的概率
第一个图:所有“最佳行动”预测概率的移动平均值。我们可以看到,事实上,对于其中的几个,我们都在0左右。 决策表的尾部显示所有分布实际上都是[1,0,0]相似的。因此,从这一点上恢复的机会是没有的,因为我们基本上已经关闭了勘探
添加少量勘探(bag 5ε0.02)有助于最终收敛到全局最小值,并给出如下图:
学习速度似乎不快,但问题情境实际上是最模糊的,不会造成太多遗憾
相关问题 更多 >
编程相关推荐