在python中使用descripe()获取具有(分析)权重的描述性统计信息

2024-09-29 23:31:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图将代码从Stata翻译成Python

Stata中的原始代码:

by year, sort : summarize age [aweight = wt]

通常一个简单的describe()函数就可以了

dataframe.groupby("year")["age"].describe()

但我找不到将aweight选项翻译成python语言的方法,即在分析/方差加权下给出数据集的描述性统计数据

用python生成数据集的代码:

dataframe = {'year': [2016,2016,2020, 2020], 'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]}

如果我在stata上运行by year, sort : summarize age [aweight = wt]

结果是:平均值=49.842,标准差=16.37

我应该怎么做才能在Python中获得相同的结果


Tags: 数据函数代码dataframeageby选项sort
1条回答
网友
1楼 · 发布于 2024-09-29 23:31:23

因此,我编写了一个函数,它执行与describe相同的操作,只是接受了一个权重参数。我在您提供的小数据框上进行了测试,但没有详细介绍。我试着不使用.apply以防您有一个大的数据帧,尽管我没有运行基准点来查看我的方法是否比编写一个函数来为每个by组进行加权描述,然后使用apply将其应用于数据帧中的每个by组更快/更少的内存占用。这可能是最简单的

计数、最小值和最大值可不考虑加权。然后,我对无偏方差公式进行了简单的加权平均值和标准偏差。我包括了一个频率加权选项,它应该只影响用于调整方差的无偏估计量的样本大小。频率权重应使用权重之和作为样本大小,否则,使用数据中的计数。我使用this answer来帮助获得加权百分位数

import pandas as pd
import numpy as np

df = pd.DataFrame({'year': [2016,2016,2020, 2020], 
                        'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]})
df
   year  age   wt
0  2016   41  1.2
1  2016   65  0.7
2  2020   35  0.8
3  2020   28  1.5

然后我定义下面的函数

def weighted_groupby_describe(df, col, by, wt, frequency=False):
    '''
    df : dataframe
    col : column for which you want statistics, must be single column
    by : groupby column(s)
    wt : column to use for weights
    frequency : if True, use sample size as sum of weights (only effects degrees
    of freedom correction for unbiased variance)
    '''
    
    if isinstance(by, list):
        df = df.sort_values(by+[col])
    else:
        df = df.sort_values([by] + [col])
    
    newcols = ['gb_weights', 'col_weighted', 'col_mean', 
        'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
    assert all([c not in df.columns for c in newcols])
    
    df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
    
    df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
    
    df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
    
    df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
    
    df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
    df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
    
    wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
    wstd.name = 'std'
    
    wmean = df.groupby(by)['col_weighted'].sum()
    wmean.name = 'mean'
    
    df['ngroup'] = df.groupby(by).ngroup()
    quantiles = np.array([0.25, 0.5, 0.75])
    weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
    ngroups = df['ngroup'].max()+1
    x = np.hstack([quantiles+i for i in range(ngroups)])
    quantvals = np.interp(x, weighted_quantiles, df[col])
    quantvals = np.reshape(quantvals, (ngroups, -1))
    
    other = df.groupby(by)[col].agg(['min', 'max', 'count'])
    
    stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
    
    stats['25%'] = quantvals[:, 0]
    stats['50%'] = quantvals[:, 1]
    stats['75%'] = quantvals[:, 2]
    
    colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    stats = stats[colorder]
    
    if frequency:
        sizes = df.groupby(by)[wt].sum()
    else:
        sizes = stats['count']
    
    stats['weight'] = sizes
    
    # use the "sample size" (weight) to obtain std. deviation from unbiased
    # variance
    stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
    
    return stats

并进行测试

weighted_groupby_describe(df, 'age', 'year', 'wt')
      count       mean        std  min  ...        50%        75%  max  weight
year                                    ...                                   
2016      2  49.842105  16.372398   41  ...  49.842105  61.842105   65       2
2020      2  30.434783   4.714936   28  ...  30.434783  33.934783   35       2

将其与不带权重的输出进行比较

df.groupby('year')['age'].describe()
      count  mean        std   min    25%   50%    75%   max
year                                                        
2016    2.0  53.0  16.970563  41.0  47.00  53.0  59.00  65.0
2020    2.0  31.5   4.949747  28.0  29.75  31.5  33.25  35.0

相关问题 更多 >

    热门问题