如何计算数据帧中所有列的扩展平均值并添加到数据帧中

2024-10-01 09:33:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图为数据帧的每一列计算前面所有行的平均值,并将计算出的平均值列添加到数据帧中

我正在使用一组nba比赛数据,其中包含20+个特征(列),我正在尝试计算这些特征的平均值。数据集示例如下(注“……”表示其余要素列)

Team TeamPoints OpponentPoints.... TeamPoints_mean  OpponentPoints_mean 
ATL      102        109       ....     nan               nan
ATL      102        92        ....     102               109
ATL      92         94        ....     102               100.5
BOS      119        122       ....     98.67             98.33
BOS      103        96        ....     103.75            104.25

计算两列的示例:

dataset = pd.read_csv('nba.games.stats.csv')
df = dataset

df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))

同样,代码只计算平均值并将列一次添加到数据帧中。有没有一种方法可以获取列的平均值并将它们添加到数据帧中,而不必一次执行一个操作?循环?下面是我要找的例子

Team TeamPoints OpponentPoints.... TeamPoints_mean  OpponentPoints_mean ...("..." = mean columns of rest of the feature columns) 
ATL      102        109       ....     nan               nan
ATL      102        92        ....     102               109
ATL      92         94        ....     102               100.5
BOS      119        122       ....     98.67             98.33
BOS      103        96        ....     103.75            104.25


Tags: csv数据示例df特征nanmeandataset
2条回答

试试这个:

(0)样本输入:

>>> df
       col1      col2      col3
0  1.490977  1.784433  0.852842
1  3.726663  2.845369  7.766797
2  0.042541  1.196383  6.568839
3  4.784911  0.444671  8.019933
4  3.831556  0.902672  0.198920
5  3.672763  2.236639  1.528215
6  0.792616  2.604049  0.373296
7  2.281992  2.563639  1.500008
8  4.096861  0.598854  4.934116
9  3.632607  1.502801  0.241920

然后处理:

(1)边表获取边上的所有平均值(我没有找到累积平均值函数,所以使用cumsum+count

>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
        col1       col2       col3  col_temp
0   1.490977   1.784433   0.852842       1.0
1   5.217640   4.629801   8.619638       2.0
2   5.260182   5.826184  15.188477       3.0
3  10.045093   6.270855  23.208410       4.0
4  13.876649   7.173527  23.407330       5.0
5  17.549412   9.410166  24.935545       6.0
6  18.342028  12.014215  25.308841       7.0
7  20.624021  14.577855  26.808849       8.0
8  24.720882  15.176708  31.742965       9.0
9  28.353489  16.679509  31.984885      10.0
>>> for el in df.columns:
...     df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
   col1_mean  col2_mean  col3_mean
0   1.490977   1.784433   0.852842
1   2.608820   2.314901   4.309819
2   1.753394   1.942061   5.062826
3   2.511273   1.567714   5.802103
4   2.775330   1.434705   4.681466
5   2.924902   1.568361   4.155924
6   2.620290   1.716316   3.615549
7   2.578003   1.822232   3.351106
8   2.746765   1.686301   3.526996
9   2.835349   1.667951   3.198489

(2)在索引上连接回:

>>> df_final=df.join(df_side)
>>> df_final
       col1      col2      col3  col1_mean  col2_mean  col3_mean
0  1.490977  1.784433  0.852842   1.490977   1.784433   0.852842
1  3.726663  2.845369  7.766797   2.608820   2.314901   4.309819
2  0.042541  1.196383  6.568839   1.753394   1.942061   5.062826
3  4.784911  0.444671  8.019933   2.511273   1.567714   5.802103
4  3.831556  0.902672  0.198920   2.775330   1.434705   4.681466
5  3.672763  2.236639  1.528215   2.924902   1.568361   4.155924
6  0.792616  2.604049  0.373296   2.620290   1.716316   3.615549
7  2.281992  2.563639  1.500008   2.578003   1.822232   3.351106
8  4.096861  0.598854  4.934116   2.746765   1.686301   3.526996
9  3.632607  1.502801  0.241920   2.835349   1.667951   3.198489

I am trying to calculate the means of all previous rows for each column of the DataFrame

要获取所有列,可以执行以下操作:

df_means = df.join(df.cumsum()/
                     df.applymap(lambda x:1).cumsum(),
                   r_suffix = "_mean")

但是,如果Team是一个列而不是索引,那么您应该去掉它:

df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
                     df_data.applymap(lambda x:1).cumsum(),
                   r_suffix = "_mean")

你也可以这样做

import numpy as np
df_data = df[[col for col in df.columns 
              if np.issubdtype(df[col],np.number)]]

或者手动定义要取平均值的列的列表cols_for_mean,然后执行以下操作

df_data = df[cols_for_mean]

相关问题 更多 >