Thompson采样：在Python中为人工智能添加正向奖励和负向奖励问题的回答

Thompson采样：在Python中为人工智能添加正向奖励和负向奖励

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

在AI速成课程的第5章中，作者写道 <pre><code> nSelected = nPosReward + nNegReward for i in range(d): print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times') print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1)) </code></pre> 为什么负面奖励的数量与正面奖励的数量相加？要找到最好的机器，难道我们不应该只关注回报率最高的机器吗？我不明白为什么我们要把消极的奖励加上积极的奖励。我还了解到，这是一个模拟，您随机分配成功率，并预先分配成功率。然而在现实生活中，你如何提前知道每台老虎机的成功率？您如何知道哪些机器应该被分配“1”？非常感谢你！以下是完整的代码： <pre><code># Importing the libraries import numpy as np # Setting conversion rates and the number of samples conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05] N = 10000 d = len(conversionRates) # Creating the dataset X = np.zeros((N, d)) for i in range(N): for j in range(d): if np.random.rand() < conversionRates[j]: X[i][j] = 1 # Making arrays to count our losses and wins nPosReward = np.zeros(d) nNegReward = np.zeros(d) # Taking our best slot machine through beta distribution and updating its losses and wins for i in range(N): selected = 0 maxRandom = 0 for j in range(d): randomBeta = np.random.beta(nPosReward[j] + 1, nNegReward[j] + 1) if randomBeta > maxRandom: maxRandom = randomBeta selected = j if X[i][selected] == 1: nPosReward[selected] += 1 else: nNegReward[selected] += 1 # Showing which slot machine is considered the best nSelected = nPosReward + nNegReward for i in range(d): print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times') print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1)) </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

史蒂文， （我是在你发帖7个月后写这篇文章的。希望你能回复并分享一些你自己的见解，以及你用来获得见解的代码。我在2020年11月底开始了人工智能速成课程，同样，我对第5章中的汤普森抽样很好奇。在我的案例中，我主要对mpson采样并没有选择最好的机器。我很好奇“最差的机器”被选择的频率有多高。所以在过去的六周里，我可能尝试了上千种不同的代码变体来获得一些见解。我可能犯了上千个编码错误和上百个不同的兔子洞，试图“抓住”当Thompson不起作用时。还要了解betaRandom函数以及添加posRewards和Negerwards的工作原理。下面的代码中可能有错误，并且可以使获得更好洞察力的总体方法更具图形化，因此请友好一些。：- curlycharcoal在其深思熟虑的回答中提供了一些见解。即使是同一章中的Hadelin也为读者提供了许多见解。下面是一个“迭代”、“捕捉错误”方法的尝试，帮助我获得了一些见解。我们可以尝试下面的代码和“比较”posReward+Negroward与posRewardOnly的结果 考虑以下几点：首先，插入几行代码：只累积POSSERVITICS。此外，在结论打印语句中插入一些附加参数，这样就可以看到两者的结果。也可以插入转换率的真值（即X值的平均值）。因此，您可以显示实际使用的转换率。删除有关其他机器选择的多个打印语句，只是为了清理输出 第二，在Hadelin的大部分原始代码上创建一个大循环。然后对该循环进行迭代。因为我们将posRewardOnly结果插入了结论打印中，所以您可以比较添加负奖励时的结果，与选择只有正奖励的最佳机器时的结果。（您可以将此外部循环视为粗糙的“AI”在测试环境中，您可以深入了解哪种方法执行效果更好。） 我们甚至在每次迭代中都插入一个数组，为它正确地选择了一台机器，该机器将向负方向与正方向进行比较，并在最后将其绘制出来。（我没有这样做，但很高兴看到） 我们还可以插入一个数组来跟踪内环上的原始betaRandom选择，与实际的最佳机器进行比较，并查看酒鬼如何在每个时间步上进行选择，最终清醒过来，并在N足够大（通常是几千个时间步N&gt；5000）时选择最佳机器 此外，我们还可以比较是否有五台机器没有选择最好的机器（这将提供对汤普森抽样总体错误率的一些了解），N=600有趣的是，有时有多达25%的机器没有选择最好的机器，有时甚至选择了最差的机器（尽管很少） 此外，正如curlycharcoal所指出的，负奖励并不总是通过每N分配，对于每台机器，只有当betarandom函数的结果返回maxValue时，才会分配负奖励，然后选择该机器提供“样本”。也就是说，如果您使用下面的代码，您可能会发现您的PosOnlyWaward想法可能会比Pos+Neg奖励执行得更好，收敛速度更快……或者确实如此？；-） <hr/> <hr/> <pre><code>###################################################################### # Try and catch when Thompson fails: # Also, compare performance of selecting # based on negRewards+posRewards vs just posRewards # 03 Jan 2021 JFB # ####################################################################### import numpy as np np.set_printoptions(precision=2) # Note the following are the target conversion rates. # Further down pls remember to compare actual rates against selected machine. # Also, in later versions reorder the rates from low to hi and visa-versa # to determine if there is some "greedy Thompson" bias # based on order of best rates. conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]# hadelins AI Crash Course N = 200 # Increasing N improves the result, Hadelin explains this in same chapter # I've found that 10,000 results in about 1% error # 2000 in about 20% error give or take when using # Hadelin's original conversion rates above. # 100 results results in about 48% error, # and posRewards + negRewards disagree with posRewardOnly varying percent, # my initial sampling of this indicates will be tricky to determine which # performs better over a variety of situations. But Hadelin provides code # to create "tests" with various input states and policies. catchLimit = 100 d = len(conversionRates) wrong = 0.0 pcntWrong = 0.0 selectedWrong = 0.0 posOnlyWrong = 0.0 pcntPosOnlyWrong = 0.0 posOnlyVsActual = 0.0 pcntPosOnlyVsActual = 0.0 nSelectedArgMax = -1 NSelectedArgMaxPosOnly = -1 for ii in range( 1, catchLimit): ################ Original X generator ########################## #creating the set of the bandit payouts at each time t. # Five columns, many rows. # a 1 value means the the slot machine # paid out if you selected that machine at this point in time. # this can be improved upon so we can order # the best to worst, and visa vs. # X = np.zeros((N, d)) for i in range(N): for j in range(d): if np.random.rand() < conversionRates[j]: X[i][j] = 1 Xmean = X.mean(axis=0) ############## end of the Original X generator ################### #make arrays to count rewards from the table of losses and wins. nPosReward = np.zeros(d) nNegReward = np.zeros(d) #Taking our best slot machine through beta distribution # and updating its losses and wins. # Taking some of the slot machines through the beta distribution, # with the goal of # determining which slot machine is the best. # because sometimes the best slot machine isn't found. for i in range(N): selected = 0 maxRandom = 0 for j in range(d): randomBeta = np.random.beta(nPosReward[j] + 1, nNegReward[j] + 1) if randomBeta > maxRandom: maxRandom = randomBeta selected = j if X[i][selected] == 1: nPosReward[selected] +=1 else: nNegReward[selected] +=1 nSelected = nPosReward + nNegReward nSelectedPosOnly = nPosReward nSelectedArgMax = np.argmax(nSelected) + 1 nSelectedArgMaxPosOnly = np.argmax(nSelectedPosOnly) + 1 XMeanArgMax = np.argmax(Xmean) + 1 # find the actual true best slot machine if ( nSelectedArgMax != XMeanArgMax and XMeanArgMax != nSelectedArgMaxPosOnly): #for i in range(d): #print('Machine number ' + str(i+1) + ' was selected ' + str(nSelected[i]) + ' times') print('Fail: Pos&Neg predct slot ' + str(nSelectedArgMax), 'posOnly predct ' + str(nSelectedArgMaxPosOnly), 'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' ) wrong +=1 elif ( nSelectedArgMax != XMeanArgMax and XMeanArgMax == nSelectedArgMaxPosOnly): print('PosOnly==Actual pos&neg ' + str(nSelectedArgMax), 'posOnly predct ' + str(nSelectedArgMaxPosOnly), 'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'*' ) selectedWrong +=1 elif ( nSelectedArgMax == XMeanArgMax and XMeanArgMax != nSelectedArgMaxPosOnly): print('PosNeg==Actual predcts ' + str(nSelectedArgMax), 'posOnly predct ' + str(nSelectedArgMaxPosOnly), 'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'***' ) posOnlyWrong +=1 elif ( nSelectedArgMax == nSelectedArgMaxPosOnly and XMeanArgMax != nSelectedArgMax): print('PosNeg == PosOnly but != actual ' + str(nSelectedArgMax), 'posOnly predct ' + str(nSelectedArgMaxPosOnly), 'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' ) wrong +=1 pcntWrong = wrong / catchLimit * 100 pcntSelectedWrong = selectedWrong / catchLimit * 100 pcntPosOnlyVsActual = posOnlyWrong / catchLimit * 100 print('Catch Limit =', catchLimit, 'N=', N) print('<>wrong: pos+neg != Actual, and PosOnly != Actual Failure Rate= %.1f' %pcntWrong, '%') print('* PosOnly == Actual but Actual != pos+neg Failure rate = %.1f' %pcntSelectedWrong,'%') print('** pos+Neg == Actual but Actual != PosOnly Failure rate = %.1f' %pcntPosOnlyVsActual, '%') ############# END ################# </code></pre>

Thompson采样：在Python中为人工智能添加正向奖励和负向奖励

1 个回答

相关Python问题