Thompson采样:在Python中为人工智能添加正向奖励和负向奖励

2024-09-27 21:31:09 发布

您现在位置:Python中文网/ 问答频道 /正文

在AI速成课程的第5章中,作者写道

 nSelected = nPosReward + nNegReward 

for i in range(d):
    print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times')
print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1))

为什么负面奖励的数量与正面奖励的数量相加?要找到最好的机器,难道我们不应该只关注回报率最高的机器吗?我不明白为什么我们要把消极的奖励加上积极的奖励。我还了解到,这是一个模拟,您随机分配成功率,并预先分配成功率。然而在现实生活中,你如何提前知道每台老虎机的成功率?您如何知道哪些机器应该被分配“1”?非常感谢你!以下是完整的代码:

# Importing the libraries
import numpy as np

# Setting conversion rates and the number of samples
conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]
N = 10000
d = len(conversionRates)


# Creating the dataset
X = np.zeros((N, d))

for i in range(N):

    for j in range(d):
        if np.random.rand() < conversionRates[j]:
            X[i][j] = 1


# Making arrays to count our losses and wins
nPosReward = np.zeros(d)
nNegReward = np.zeros(d)


# Taking our best slot machine through beta distribution and updating its losses and wins
for i in range(N):
    selected = 0
    maxRandom = 0


    for j in range(d):
        randomBeta = np.random.beta(nPosReward[j] + 1, nNegReward[j] + 1)
        if randomBeta > maxRandom:
            maxRandom = randomBeta
            selected = j


    if X[i][selected] == 1:
        nPosReward[selected] += 1
    else:
        nNegReward[selected] += 1






# Showing which slot machine is considered the best

nSelected = nPosReward + nNegReward 

for i in range(d):
    print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times')
print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1))

Tags: andtheinnumberforisnprange
2条回答

随着越来越多的反馈,汤普森抽样将其重点越来越多地从勘探转移到开发。也就是说,对于大的nSelected值(由于大的N),所有的贝塔分布将非常集中在它们的平均值(nPosReward[i]/nSelected[i])周围,对于更大的迭代,随着概率的增加,汤普森采样将选择它认为最有价值的机器。通过观察足够长的视界,您将看到考虑得最好的机器也是最常选择的机器的概率推近1

总之,你的直觉是正确的。期望值最高的机器(考虑到目前为止观察到的反馈)是经验平均值最高的机器。由于我刚才描述的概率现象,如果运行该算法足够长的时间,最频繁拾取的机器和预期回报最高的机器将与接近1的概率一致

关于你问题的第二部分,我们不知道成功率。如果我们知道它们,最优算法将简单地选择在任何时候都具有最高成功率的算法。我们在现实生活中所做的就是观察这些随机过程的输出。例如,当你显示在线广告时,你不知道他们点击的概率。然而,假设每个人的行为方式都相同,通过向人们展示并观察他们是否点击,我们可以快速了解成功率

史蒂文,
(我是在你发帖7个月后写这篇文章的。希望你能回复并分享一些你自己的见解,以及你用来获得见解的代码。我在2020年11月底开始了人工智能速成课程,同样,我对第5章中的汤普森抽样很好奇。在我的案例中,我主要对mpson采样并没有选择最好的机器。我很好奇“最差的机器”被选择的频率有多高。所以在过去的六周里,我可能尝试了上千种不同的代码变体来获得一些见解。我可能犯了上千个编码错误和上百个不同的兔子洞,试图“抓住”当Thompson不起作用时。还要了解betaRandom函数以及添加posRewards和Negerwards的工作原理。下面的代码中可能有错误,并且可以使获得更好洞察力的总体方法更具图形化,因此请友好一些。:-

curlycharcoal在其深思熟虑的回答中提供了一些见解。即使是同一章中的Hadelin也为读者提供了许多见解。下面是一个“迭代”、“捕捉错误”方法的尝试,帮助我获得了一些见解。我们可以尝试下面的代码和“比较”posReward+Negroward与posRewardOnly的结果

考虑以下几点:首先,插入几行代码:只累积POSSERVITICS。此外,在结论打印语句中插入一些附加参数,这样就可以看到两者的结果。也可以插入转换率的真值(即X值的平均值)。因此,您可以显示实际使用的转换率。删除有关其他机器选择的多个打印语句,只是为了清理输出

第二,在Hadelin的大部分原始代码上创建一个大循环。然后对该循环进行迭代。因为我们将posRewardOnly结果插入了结论打印中,所以您可以比较添加负奖励时的结果,与选择只有正奖励的最佳机器时的结果。(您可以将此外部循环视为粗糙的“AI”在测试环境中,您可以深入了解哪种方法执行效果更好。)

我们甚至在每次迭代中都插入一个数组,为它正确地选择了一台机器,该机器将向负方向与正方向进行比较,并在最后将其绘制出来。(我没有这样做,但很高兴看到)

我们还可以插入一个数组来跟踪内环上的原始betaRandom选择,与实际的最佳机器进行比较,并查看酒鬼如何在每个时间步上进行选择,最终清醒过来,并在N足够大(通常是几千个时间步N>;5000)时选择最佳机器

此外,我们还可以比较是否有五台机器没有选择最好的机器(这将提供对汤普森抽样总体错误率的一些了解),N=600有趣的是,有时有多达25%的机器没有选择最好的机器,有时甚至选择了最差的机器(尽管很少)

此外,正如curlycharcoal所指出的,负奖励并不总是通过每N分配,对于每台机器,只有当betarandom函数的结果返回maxValue时,才会分配负奖励,然后选择该机器提供“样本”。也就是说,如果您使用下面的代码,您可能会发现您的PosOnlyWaward想法可能会比Pos+Neg奖励执行得更好,收敛速度更快……或者确实如此?;-)



######################################################################
# Try and catch when Thompson fails:
# Also, compare performance of selecting 
# based on negRewards+posRewards vs just posRewards
# 03 Jan 2021 JFB
#
#######################################################################
import numpy as np
np.set_printoptions(precision=2)

# Note the following are the target conversion rates. 
# Further down pls remember to compare actual rates against selected machine.
# Also, in later versions reorder the rates from low to hi and visa-versa
# to determine if there is some "greedy Thompson" bias
# based on order of best rates.
conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]# hadelins AI Crash Course

N = 200   
# Increasing N improves the result, Hadelin explains this  in same chapter
# I've found that 10,000 results in about 1% error
# 2000 in about 20% error give or take when using 
# Hadelin's original conversion rates above.
# 100 results results in about 48% error, 
# and posRewards + negRewards disagree with posRewardOnly varying percent,
# my initial sampling of this indicates will be tricky to determine which
# performs better over a variety of situations.  But Hadelin provides code
# to create "tests" with various input states and policies.

catchLimit = 100

d = len(conversionRates)
wrong = 0.0
pcntWrong = 0.0

selectedWrong = 0.0

posOnlyWrong = 0.0
pcntPosOnlyWrong = 0.0

posOnlyVsActual = 0.0
pcntPosOnlyVsActual = 0.0

nSelectedArgMax = -1
NSelectedArgMaxPosOnly = -1

for ii in range( 1, catchLimit):

    ################   Original X generator  ##########################
    #creating the set of the bandit payouts at each time t.
    # Five columns, many rows. 
    # a 1 value means the the slot machine 
    # paid out if you selected that machine at this point in time.
    # this can be improved upon so we can order 
    # the best to worst, and visa vs.
    #
    X = np.zeros((N, d))
    for i in range(N):
        for j in range(d):
            if np.random.rand() < conversionRates[j]:
                X[i][j] = 1

    Xmean = X.mean(axis=0)
    ##############  end of the Original X generator  ###################
    
    #make arrays to count  rewards from the table of losses and wins.
    nPosReward = np.zeros(d)
    nNegReward = np.zeros(d)
    
    #Taking our best slot machine through beta distribution 
    # and updating its losses and wins.
    # Taking some of the slot machines through the beta distribution, 
    # with the goal of 
    # determining which slot machine is the best. 
    # because sometimes the best slot machine isn't found.
    for i in range(N):
        selected = 0
        maxRandom = 0
        for j in range(d):
            randomBeta = np.random.beta(nPosReward[j] + 1, 
                                        nNegReward[j] + 1)
            if randomBeta > maxRandom:
                maxRandom = randomBeta
                selected = j
        if X[i][selected] == 1:
            nPosReward[selected] +=1
        else:
            nNegReward[selected] +=1
            
    nSelected = nPosReward + nNegReward
    nSelectedPosOnly = nPosReward
    
    nSelectedArgMax = np.argmax(nSelected) + 1
    nSelectedArgMaxPosOnly = np.argmax(nSelectedPosOnly) + 1
    
    XMeanArgMax = np.argmax(Xmean) + 1  # find the actual true best slot machine

    if ( nSelectedArgMax != XMeanArgMax and
        XMeanArgMax != nSelectedArgMaxPosOnly):
        #for i in range(d):
            #print('Machine number ' + str(i+1) + ' was selected ' + str(nSelected[i]) + ' times')
        print('Fail: Pos&Neg predct slot ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' )
        wrong +=1
     
    elif ( nSelectedArgMax != XMeanArgMax and
             XMeanArgMax == nSelectedArgMaxPosOnly):
        
        print('PosOnly==Actual pos&neg ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'*' )
        selectedWrong +=1
        
    elif ( nSelectedArgMax == XMeanArgMax and
                 XMeanArgMax != nSelectedArgMaxPosOnly):
        print('PosNeg==Actual predcts ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'***' )
        posOnlyWrong +=1
        
    elif ( nSelectedArgMax == nSelectedArgMaxPosOnly and
                 XMeanArgMax != nSelectedArgMax):
        print('PosNeg == PosOnly but != actual ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' )
        wrong +=1  
        
pcntWrong = wrong / catchLimit * 100
pcntSelectedWrong = selectedWrong / catchLimit * 100
pcntPosOnlyVsActual = posOnlyWrong / catchLimit * 100

print('Catch Limit =', catchLimit, 'N=', N)
print('<>wrong: pos+neg != Actual, and PosOnly != Actual  Failure Rate=  %.1f' %pcntWrong, '%')
print('* PosOnly == Actual but Actual != pos+neg  Failure rate =  %.1f' %pcntSelectedWrong,'%')
print('** pos+Neg == Actual but Actual != PosOnly  Failure rate =   %.1f' %pcntPosOnlyVsActual, '%')

############# END #################

相关问题 更多 >

    热门问题