掩蔽数据帧的乘积

2024-09-29 06:34:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个6M+观测值的数据框,其中20个列是将应用于单个得分列的权重。即Wgt1*Wgt2*Wgt3…*分数。另外,不是每个权重都适用于每个观察,所以我创建了20列来表示权重掩码。即,(Wgt1*Msk1)*(Wgt2*Msk2)*(Wgt3*Msk3)。。。得分。当掩码为0时,权重不适用;当掩码为1时,权重适用。你知道吗

对于数据帧中的每一行,我想: 1,勾选2个限定条件,指示该行应被处理 2、求出重量的乘积,以存在相应的掩码(ttl\u wgt)为准 3,将此乘积乘以分数(prob),得出最终加权分数

为此,我创建了一个用户定义函数:

import functools
import operator
import time    
def mymult(a):
        ttl_wgt = float('NaN') #Initialize to NaN
        if ~np.isnan(a['ID']): #condition 1, only process if an ID is present
            if a['prob'] > -1.0: #condition 2, only process if our unweighted score is NOT -1.0
                b = np.where(a[msks] ==1)[0] #index for which of our masks is 1?
                ttl_wgt = functools.reduce(operator.mul, a[np.asarray(wgt_nms)[b]], 1)
        return ttl_wgt

我在开发过程中耗尽了内存,所以我决定一次把它分为500000行。我使用lambda函数应用于块:

msks = ['Msk1','Msk2','Msk3','Msk4',...,'Msk20']
wgt_nms = ['Wgt1','Wgt2','Wgt3','Wgt4',...,'Wgt20']
print('Determining final weights...')
chunksize = 500000 #we'll operate on this many rows at a time
start_time = time.time()
ttl_wgts = [] #initialize list to hold weight products
for i in range(0,len(df),chunksize): 
    ttl_wgts.extend(df[i:(i+chunksize)].apply(lambda x: mymult(x), axis=1))
print("--- %s seconds ---" % (time.time() - start_time)) #Expect between 30 and 40 minutes
print('Done!')

然后我将ttl\u wgts列表作为数据帧中的一个新列进行赋值,并得到权重*初始得分的最终乘积。你知道吗

#Initialize the fields
#Might not be necessary or evenuseful
df['ttl_wgt'] = float('NaN')
df['wgt_prob'] = float('NaN')

df['ttl_wgt'] = ttl_wgts
df['wgt_prob'] = df['ttl_wgt'] * df['prob']

我查看了multiplying elements in a list上的一篇文章。这是一个很好的思想食粮,但我没能把它变成任何更有效的我的6米以上的观察。我是否应该考虑其他方法?你知道吗

添加示例df,如建议的那样

数据帧的一个示例可能看起来像这样,只有3个掩码/权重:

df = pd.DataFrame({'id': [999999999,136550,80010170,80010177,90002408,90002664,16207501,62992,np.nan,80010152], 
                   'prob': [-1,0.180274382,0.448361456,0.000945058,0.005060279,0.009893078,0.169686288,0.109541453,0.117907763,0.266242921],
                   'Msk1': [0,1,1,1,0,0,1,0,0,0],
                   'Msk2': [0,0,1,0,0,0,0,1,0,0],
                   'Msk3': [1,0,0,0,1,1,0,0,1,1],
                   'Wgt1': [np.nan,0.919921875,1.08984375,1.049804688,np.nan,np.nan,np.nan,0.91015625,np.nan,0.810058594],
                   'Wgt2': [np.nan,1.129882813,1.120117188,0.970214844,np.nan,np.nan,np.nan,1.0703125,np.nan,0.859863281],
                   'Wgt3': [np.nan,1.209960938,1.23046875,1,np.nan,np.nan,np.nan,1.150390625,np.nan,0.649902344]
                   })

在第一次观察中,prob字段是-1,因此不会处理该行。在第二次观察中,Msk1打开,而Msk2和Msk3关闭。因此,最终重量为Wgt1=0.919922的值。在第3行,Mask1和Msk2处于打开状态,而Msk3处于关闭状态。因此,最终重量为Wgt1*Wgt2=1.089844*1.120117=1.220752。你知道吗


Tags: 数据dftimenpnan权重掩码ttl
1条回答
网友
1楼 · 发布于 2024-09-29 06:34:46

IIUC公司:

你想用1填充你的蒙版重量。然后你就可以把它们叠加在一起,而不会受到被掩盖的影响。这就是窍门。你必须根据需要使用它。你知道吗

创建msk

msk = df.filter(like='Msk')
print(msk)

   Msk1  Msk2  Msk3
0     0     0     1
1     1     0     0
2     1     1     0
3     1     0     0
4     0     0     1
5     0     0     1
6     1     0     0
7     0     1     0
8     0     0     1
9     0     0     1

创建wgt

wgt = df.filter(like='Wgt')
print(wgt)

       Wgt1      Wgt2      Wgt3
0       NaN       NaN       NaN
1  0.919922  1.129883  1.209961
2  1.089844  1.120117  1.230469
3  1.049805  0.970215  1.000000
4       NaN       NaN       NaN
5       NaN       NaN       NaN
6       NaN       NaN       NaN
7  0.910156  1.070312  1.150391
8       NaN       NaN       NaN
9  0.810059  0.859863  0.649902

创建new_weight

new_wgt = np.where(msk, wgt, 1)
print(new_wgt)

[[ 1.          1.                 nan]
 [ 0.91992188  1.          1.        ]
 [ 1.08984375  1.12011719  1.        ]
 [ 1.04980469  1.          1.        ]
 [ 1.          1.                 nan]
 [ 1.          1.                 nan]
 [        nan  1.          1.        ]
 [ 1.          1.0703125   1.        ]
 [ 1.          1.                 nan]
 [ 1.          1.          0.64990234]]

最终prod_wgt

prod_wgt = pd.Series(new_wgt.prod(1), wgt.index)
print(prod_wgt)

0         NaN
1    0.919922
2    1.220753
3    1.049805
4         NaN
5         NaN
6         NaN
7    1.070312
8         NaN
9    0.649902
dtype: float64

相关问题 更多 >