聚合、分组并取消对多个列的堆栈

2024-09-29 17:14:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用python,我有177列的数据框,其中包含24小时的患者值,如图所示

subject_id hour_measure         urinecolor   Respiraory                 
3          1.00                 red          40
3          1.15                 red          90
4          2.00              yellow          60

我想每小时计算一些统计数据,如平均值、最大值、标准差、偏差等

由于它包含文本和数字列,所以不能在所有数据帧中循环以进行聚合,因此,我尝试为每一列进行聚合 就像下面的代码一样

 grouped= df.groupby(['Hour_measure','subject_id']).agg({"Heart Rate":['sum','min','max','std', 'count','var','skew']}) 
grouped2= df.groupby(['Hour_measure','subject_id']).agg({"Respiraory":['sum','min','max','std', 'count']})
  #write aggregated values to csv file 
 grouped.coloumns=["_".join(x) for x in grouped.columns.ravel()]
           grouped.to_csv('temp3.csv')

     with open('temp3.csv', 'a') as f:
        grouped2.to_csv(f, header=True)
    # make unstack to convert all to rows               
        df.set_index(['subject_id','Hour_measure']).unstack()

这段代码工作正常,但我想使用循环来聚合每个数字列。对于每个文本列,选择小时内最频繁的值,而不是统计函数,并将其添加到文件中,该文件最终将根据主题id和小时度量进行堆叠 终于这样了

              heart rate 
                  1                             2              3.... to 24      then the next feature 
subject_id   min    max   std   skwe      min   max   std    
 1            40     110    50   60       60   290     40  

Tags: csvto数据iddfredminmax
1条回答
网友
1楼 · 发布于 2024-09-29 17:14:12

使用:

print (df)
   hour  subject_id  hour_measure urinecolor  Respiraory
0     1           3          1.00        red          40
1     1           3          1.15        red          90
2     1           4          2.00     yellow          60

df1 = (df.groupby(['hour_measure','subject_id', 'hour'])
        .agg(['sum','min','max','std', 'count','var','skew']))
print (df1)
                             Respiraory                           
                                    sum min max std count var skew
hour_measure subject_id hour                                      
1.00         3          1            40  40  40 NaN     1 NaN  NaN
1.15         3          1            90  90  90 NaN     1 NaN  NaN
2.00         4          1            60  60  60 NaN     1 NaN  NaN

f = lambda x: next(iter(x.mode()), None)
cols = df.select_dtypes(object).columns
df2 = df.groupby(['hour_measure','subject_id', 'hour'])[cols].agg(f)
df2.columns = pd.MultiIndex.from_product([df2.columns, ['mode']])
print (df2)
                             urinecolor
                                   mode
hour_measure subject_id hour           
1.00         3          1           red
1.15         3          1           red
2.00         4          1        yellow

df3 = pd.concat([df1, df2], axis=1).unstack().reorder_levels([0,2,1], axis=1)
print (df3)
                        Respiraory                            urinecolor
hour                             1                                     1
                               sum min max std count var skew       mode
hour_measure subject_id                                                 
1.00         3                  40  40  40 NaN     1 NaN  NaN        red
1.15         3                  90  90  90 NaN     1 NaN  NaN        red
2.00         4                  60  60  60 NaN     1 NaN  NaN     yellow

相关问题 更多 >

    热门问题