在每列上分别分组

2024-07-05 14:32:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要分别对每一列进行分组,并通过分组找出几个度量标准。 假设我有一堆feature列和一个二进制目标列。每个特征都是一个bin(字符串)。目标是整数列。为了简单起见,这里只有1和0。你知道吗

示例

import pandas as pd


var1 = ['var1_bin1', 'var1_bin2', 'var1_bin2', 'var1_bin3', 'var1_bin4', 'var1_bin4', 'var1_bin4', 'var1_bin5', 'var1_bin5', 'var1_bin5']
var2 = ['var2_bin1', 'var2_bin1', 'var2_bin2', 'var2_bin3', 'var2_bin3', 'var2_bin4', 'var2_bin4', 'var2_bin5', 'var2_bin5', 'var2_bin5']
var3 = ['var3_bin2', 'var3_bin2', 'var3_bin2', 'var3_bin3', 'var3_bin3', 'var3_bin3', 'var3_bin3', 'var3_bin4', 'var3_bin5', 'var3_bin5']
var4 = ['var4_bin1', 'var4_bin1', 'var4_bin2', 'var4_bin2', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4']
target = [1, 0, 0, 1, 1, 1, 0, 0, 0, 0]

df = pd.DataFrame({
    'var1' : var1,
    'var2' : var2,
    'var3' : var3,
    'target' : target
})

print(df)
cols = ['var1', 'var2', 'var3', 'var4', 'target']

# need groupby for each column separately:
#  For each column, I want to group by categorical elements in column and sum elements from target variable and also count how many zeros are there

for col in cols:
    x = df.groupby([col, target])[[target]].sum() #expecting aggregated metrics
    print(x)

我所期待的是,结果是一个数据帧的数据帧(或任何更好的方式),我可以直观地与您沟通如下:

Result representation
        var1                     | var2 ...
    ---------------------------- |
    bin    | sum | total_zeros   |
      -----------------          |
var1_bin1  | 1   | 0             |
var1_bin2  | 0   | 2             |
var1_bin3  | 1   | 0             |
var1_bin4  | 2   | 1             |
var1_bin5  | 0   | 3             |

Tags: target目标dfbincolumnsumvar1var2
2条回答

熊猫回答

我们可以首先使用for col in df.columns对所有列进行迭代,然后使用^{}

然后我们在这些列上GroupBy,并使用^{}。在这个聚合中,我们取target的sumtotal zeros。你知道吗

最后,我们使用^{}来连接相邻的每个组。你知道吗

dfg = pd.concat([
    (df.groupby(col)['target']
       .agg([(f'sum_{col}', 'sum'),(f'total_zeros_{col}', lambda x: x.eq(0).sum())])
       .reset_index()
    ) for col in df.columns
], axis=1)
        var1  sum_var1  total_zeros_var1       var2  sum_var2  total_zeros_var2       var3  sum_var3  total_zeros_var3       var4  sum_var4  total_zeros_var4  target  sum_target  total_zeros_target
0  var1_bin1         1                 0  var2_bin1         1                 1  var3_bin2      1.00              2.00  var4_bin1      1.00              1.00    0.00        0.00                6.00
1  var1_bin2         0                 2  var2_bin2         0                 1  var3_bin3      3.00              1.00  var4_bin2      1.00              1.00    1.00        4.00                0.00
2  var1_bin3         1                 0  var2_bin3         2                 0  var3_bin4      0.00              1.00  var4_bin4      2.00              4.00     nan         nan                 nan
3  var1_bin4         2                 1  var2_bin4         1                 1  var3_bin5      0.00              2.00        NaN       nan               nan     nan         nan                 nan
4  var1_bin5         0                 3  var2_bin5         0                 3        NaN       nan               nan        NaN       nan               nan     nan         nan                 nan

因为性能很重要,在groupby之前计数0值,而不是每个组,所以for count可能是两列的sum的聚合:

df1 = pd.concat([
    (df.assign(total_zeros = df[col].eq(0).astype(int))
       .groupby(col)['target','total_zeros']
       .sum()
       .add_suffix(f'_{col}')
       .reset_index()
    ) for col in df.columns
], axis=1)

print(df1)
        var1  target_var1  total_zeros_var1       var2  target_var2  \
0  var1_bin1            1                 0  var2_bin1            1   
1  var1_bin2            0                 0  var2_bin2            0   
2  var1_bin3            1                 0  var2_bin3            2   
3  var1_bin4            2                 0  var2_bin4            1   
4  var1_bin5            0                 0  var2_bin5            0   

   total_zeros_var2       var3  target_var3  total_zeros_var3  target  \
0                 0  var3_bin2          1.0               0.0     0.0   
1                 0  var3_bin3          3.0               0.0     1.0   
2                 0  var3_bin4          0.0               0.0     NaN   
3                 0  var3_bin5          0.0               0.0     NaN   
4                 0        NaN          NaN               NaN     NaN   

   target_target  total_zeros_target  
0            0.0                 6.0  
1            4.0                 0.0  
2            NaN                 NaN  
3            NaN                 NaN  
4            NaN                 NaN  

相关问题 更多 >