<p>史蒂文,<br/>
<em>(我是在你发帖7个月后写这篇文章的。希望你能回复并分享一些你自己的见解,以及你用来获得见解的代码。我在2020年11月底开始了人工智能速成课程,同样,我对第5章中的汤普森抽样很好奇。在我的案例中,我主要对mpson采样并没有选择最好的机器。我很好奇“最差的机器”被选择的频率有多高。所以在过去的六周里,我可能尝试了上千种不同的代码变体来获得一些见解。我可能犯了上千个编码错误和上百个不同的兔子洞,试图“抓住”当Thompson不起作用时。还要了解betaRandom函数以及添加posRewards和Negerwards的工作原理。下面的代码中可能有错误,并且可以使获得更好洞察力的总体方法更具图形化,因此请友好一些。:-</em></p>
<p>curlycharcoal在其深思熟虑的回答中提供了一些见解。即使是同一章中的Hadelin也为读者提供了许多见解。下面是一个“迭代”、“捕捉错误”方法的尝试,帮助我获得了一些见解。我们可以尝试下面的代码和“比较”posReward+Negroward与posRewardOnly的结果</p>
<P>考虑以下几点:首先,插入几行代码:只累积POSSERVITICS。此外,在结论打印语句中插入一些附加参数,这样就可以看到两者的结果。也可以插入转换率的真值(即X值的平均值)。因此,您可以显示实际使用的转换率。删除有关其他机器选择的多个打印语句,只是为了清理输出</p>
<p>第二,在Hadelin的大部分原始代码上创建一个大循环。然后对该循环进行迭代。因为我们将posRewardOnly结果插入了结论打印中,所以您可以比较添加负奖励时的结果,与选择只有正奖励的最佳机器时的结果。(您可以将此外部循环视为粗糙的“AI”在测试环境中,您可以深入了解哪种方法执行效果更好。)</p>
<p>我们甚至在每次迭代中都插入一个数组,为它正确地选择了一台机器,该机器将向负方向与正方向进行比较,并在最后将其绘制出来。(我没有这样做,但很高兴看到)</p>
<p>我们还可以插入一个数组来跟踪内环上的原始betaRandom选择,与实际的最佳机器进行比较,并查看酒鬼如何在每个时间步上进行选择,最终清醒过来,并在N足够大(通常是几千个时间步N>;5000)时选择最佳机器</p>
<p>此外,我们还可以比较是否有五台机器没有选择最好的机器(这将提供对汤普森抽样总体错误率的一些了解),N=600有趣的是,有时有多达25%的机器没有选择最好的机器,有时甚至选择了最差的机器(尽管很少)</p>
<p>此外,正如curlycharcoal所指出的,负奖励并不总是通过每N分配,对于每台机器,只有当betarandom函数的结果返回maxValue时,才会分配负奖励,然后选择该机器提供“样本”。也就是说,如果您使用下面的代码,您可能会发现您的PosOnlyWaward想法可能会比Pos+Neg奖励执行得更好,收敛速度更快……或者确实如此?;-)</p>
<hr/>
<hr/>
<pre><code>######################################################################
# Try and catch when Thompson fails:
# Also, compare performance of selecting
# based on negRewards+posRewards vs just posRewards
# 03 Jan 2021 JFB
#
#######################################################################
import numpy as np
np.set_printoptions(precision=2)
# Note the following are the target conversion rates.
# Further down pls remember to compare actual rates against selected machine.
# Also, in later versions reorder the rates from low to hi and visa-versa
# to determine if there is some "greedy Thompson" bias
# based on order of best rates.
conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]# hadelins AI Crash Course
N = 200
# Increasing N improves the result, Hadelin explains this in same chapter
# I've found that 10,000 results in about 1% error
# 2000 in about 20% error give or take when using
# Hadelin's original conversion rates above.
# 100 results results in about 48% error,
# and posRewards + negRewards disagree with posRewardOnly varying percent,
# my initial sampling of this indicates will be tricky to determine which
# performs better over a variety of situations. But Hadelin provides code
# to create "tests" with various input states and policies.
catchLimit = 100
d = len(conversionRates)
wrong = 0.0
pcntWrong = 0.0
selectedWrong = 0.0
posOnlyWrong = 0.0
pcntPosOnlyWrong = 0.0
posOnlyVsActual = 0.0
pcntPosOnlyVsActual = 0.0
nSelectedArgMax = -1
NSelectedArgMaxPosOnly = -1
for ii in range( 1, catchLimit):
################ Original X generator ##########################
#creating the set of the bandit payouts at each time t.
# Five columns, many rows.
# a 1 value means the the slot machine
# paid out if you selected that machine at this point in time.
# this can be improved upon so we can order
# the best to worst, and visa vs.
#
X = np.zeros((N, d))
for i in range(N):
for j in range(d):
if np.random.rand() < conversionRates[j]:
X[i][j] = 1
Xmean = X.mean(axis=0)
############## end of the Original X generator ###################
#make arrays to count rewards from the table of losses and wins.
nPosReward = np.zeros(d)
nNegReward = np.zeros(d)
#Taking our best slot machine through beta distribution
# and updating its losses and wins.
# Taking some of the slot machines through the beta distribution,
# with the goal of
# determining which slot machine is the best.
# because sometimes the best slot machine isn't found.
for i in range(N):
selected = 0
maxRandom = 0
for j in range(d):
randomBeta = np.random.beta(nPosReward[j] + 1,
nNegReward[j] + 1)
if randomBeta > maxRandom:
maxRandom = randomBeta
selected = j
if X[i][selected] == 1:
nPosReward[selected] +=1
else:
nNegReward[selected] +=1
nSelected = nPosReward + nNegReward
nSelectedPosOnly = nPosReward
nSelectedArgMax = np.argmax(nSelected) + 1
nSelectedArgMaxPosOnly = np.argmax(nSelectedPosOnly) + 1
XMeanArgMax = np.argmax(Xmean) + 1 # find the actual true best slot machine
if ( nSelectedArgMax != XMeanArgMax and
XMeanArgMax != nSelectedArgMaxPosOnly):
#for i in range(d):
#print('Machine number ' + str(i+1) + ' was selected ' + str(nSelected[i]) + ' times')
print('Fail: Pos&Neg predct slot ' + str(nSelectedArgMax),
'posOnly predct ' + str(nSelectedArgMaxPosOnly),
'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' )
wrong +=1
elif ( nSelectedArgMax != XMeanArgMax and
XMeanArgMax == nSelectedArgMaxPosOnly):
print('PosOnly==Actual pos&neg ' + str(nSelectedArgMax),
'posOnly predct ' + str(nSelectedArgMaxPosOnly),
'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'*' )
selectedWrong +=1
elif ( nSelectedArgMax == XMeanArgMax and
XMeanArgMax != nSelectedArgMaxPosOnly):
print('PosNeg==Actual predcts ' + str(nSelectedArgMax),
'posOnly predct ' + str(nSelectedArgMaxPosOnly),
'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'***' )
posOnlyWrong +=1
elif ( nSelectedArgMax == nSelectedArgMaxPosOnly and
XMeanArgMax != nSelectedArgMax):
print('PosNeg == PosOnly but != actual ' + str(nSelectedArgMax),
'posOnly predct ' + str(nSelectedArgMaxPosOnly),
'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' )
wrong +=1
pcntWrong = wrong / catchLimit * 100
pcntSelectedWrong = selectedWrong / catchLimit * 100
pcntPosOnlyVsActual = posOnlyWrong / catchLimit * 100
print('Catch Limit =', catchLimit, 'N=', N)
print('<>wrong: pos+neg != Actual, and PosOnly != Actual Failure Rate= %.1f' %pcntWrong, '%')
print('* PosOnly == Actual but Actual != pos+neg Failure rate = %.1f' %pcntSelectedWrong,'%')
print('** pos+Neg == Actual but Actual != PosOnly Failure rate = %.1f' %pcntPosOnlyVsActual, '%')
############# END #################
</code></pre>