如何计算变量的摘要并在python中保存为dataframe

2024-09-28 17:30:49 发布

您现在位置:Python中文网/ 问答频道 /正文

如何在python中为每个变量计算变量的摘要并保存为dataframe

我有一个熊猫数据框

Age_Bin Cat_Bin Outcome
  Age1    Cat2     0
  Age1    Cat1     1
  Age2    Cat2     1
  Age1    Cat1     1
  Age2    Cat1     0
  Age3    Cat1     0
  Age3    Cat2     0
  Age1    Cat1     1
  Age3    Cat2     1

使用下面给定的代码计算每个变量的结果分布摘要,如下所示。你知道吗

Age\ U Bin变量示例

Age_Bin Outcome_0_cnt Outcome_1_cnt Total_cnt Outcome_0_cnt% Outcome_1_cnt%
  Age1         1         3           4           1/4            3/5
  Age2         1         1           2           1/4            1/5
  Age3         2         1           3           2/4            1/5

这是使用下面的代码实现的

    df1 = ( df.groupby(['Age_Bin','Outcome'])['Cat_Bin'] .size() .unstack(fill_value=0) .add_prefix('Outcome_') ) df = df1.assign(Total_cnt=lambda x: x.sum(1)).join(df1.div(df1.sum()).add_suffix('%')) 

    print (df) 

    Outcome Outcome_0 Outcome_1 Total_cnt Outcome_0% Outcome_1%
   Age_Bin 
    Age1         1       3          4         0.25     0.6 
    Age2         1       1          2         0.25     0.2 
    Age3         2       1          3         0.50     0.2

除了上面的输出,我还需要在结果1%旁边再加一列Z。你知道吗

Z_Age= log(Outcome_1%/Outcome_0%).

然后根据给定的每个类别,将每个变量的Z值映射到原始df

     Age_Bin Cat_Bin Outcome Z_Age Z_Cat
      Age1    Cat2     0
      Age1    Cat1     1
      Age2    Cat2     1
      Age1    Cat1     1
      Age2    Cat1     0
      Age3    Cat1     0
      Age3    Cat2     0
      Age1    Cat1     1
      Age3    Cat2     1

Tags: 代码adddfagebincattotaldf1
1条回答
网友
1楼 · 发布于 2024-09-28 17:30:49

用途:

df1 = (
       df.groupby(['Age_Bin','Outcome'])['Cat_Bin']
         .size()
         .unstack(fill_value=0)
         .add_prefix('Outcome_')
      )

df2 = df1.assign(Total_cnt=lambda x: x.sum(1)).join(df1.div(df1.sum()).add_suffix('%'))
print (df2)
Outcome  Outcome_0  Outcome_1  Total_cnt  Outcome_0%  Outcome_1%
Age_Bin                                                         
Age1             1          3          4        0.25         0.6
Age2             1          1          2        0.25         0.2
Age3             2          1          3        0.50         0.2

然后添加:

df2 = df2.assign(Z_age=np.log(df2['Outcome_0%'] / df2['Outcome_1%']))
print (df2)
Outcome  Outcome_0  Outcome_1  Total_cnt  Outcome_0%  Outcome_1%     Z_age
Age_Bin                                                                   
Age1             1          3          4        0.25         0.6 -0.875469
Age2             1          1          2        0.25         0.2  0.223144
Age3             2          1          3        0.50         0.2  0.916291

#map new column by Age, not possible category because no information about it in df2
df['Z_Age'] = df['Age_Bin'].map(df2['Z_age'])
print (df)
  Age_Bin Cat_Bin  Outcome     Z_Age
0    Age1    Cat2        0 -0.875469
1    Age1    Cat1        1 -0.875469
2    Age2    Cat2        1  0.223144
3    Age1    Cat1        1 -0.875469
4    Age2    Cat1        0  0.223144
5    Age3    Cat1        0  0.916291
6    Age3    Cat2        0  0.916291
7    Age1    Cat1        1 -0.875469
8    Age3    Cat2        1  0.916291

相关问题 更多 >