如何计算数据帧中所有列的扩展平均值并添加到数据帧中

Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ATL 102 109 .... nan nan ATL 102 92 .... 102 109 ATL 92 94 .... 102 100.5 BOS 119 122 .... 98.67 98.33 BOS 103 96 .... 103.75 104.25

dataset = pd.read_csv('nba.games.stats.csv') df = dataset df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean())) df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))

Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns) ATL 102 109 .... nan nan ATL 102 92 .... 102 109 ATL 92 94 .... 102 100.5 BOS 119 122 .... 98.67 98.33 BOS 103 96 .... 103.75 104.25

2条回答

网友

1楼 · 编辑于 2024-10-01 09:33:54

试试这个：

（0）样本输入：

>>> df
       col1      col2      col3
0  1.490977  1.784433  0.852842
1  3.726663  2.845369  7.766797
2  0.042541  1.196383  6.568839
3  4.784911  0.444671  8.019933
4  3.831556  0.902672  0.198920
5  3.672763  2.236639  1.528215
6  0.792616  2.604049  0.373296
7  2.281992  2.563639  1.500008
8  4.096861  0.598854  4.934116
9  3.632607  1.502801  0.241920

然后处理：

（1）边表获取边上的所有平均值（我没有找到累积平均值函数，所以使用cumsum+count）

>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
        col1       col2       col3  col_temp
0   1.490977   1.784433   0.852842       1.0
1   5.217640   4.629801   8.619638       2.0
2   5.260182   5.826184  15.188477       3.0
3  10.045093   6.270855  23.208410       4.0
4  13.876649   7.173527  23.407330       5.0
5  17.549412   9.410166  24.935545       6.0
6  18.342028  12.014215  25.308841       7.0
7  20.624021  14.577855  26.808849       8.0
8  24.720882  15.176708  31.742965       9.0
9  28.353489  16.679509  31.984885      10.0
>>> for el in df.columns:
...     df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
   col1_mean  col2_mean  col3_mean
0   1.490977   1.784433   0.852842
1   2.608820   2.314901   4.309819
2   1.753394   1.942061   5.062826
3   2.511273   1.567714   5.802103
4   2.775330   1.434705   4.681466
5   2.924902   1.568361   4.155924
6   2.620290   1.716316   3.615549
7   2.578003   1.822232   3.351106
8   2.746765   1.686301   3.526996
9   2.835349   1.667951   3.198489

（2）在索引上连接回：

>>> df_final=df.join(df_side)
>>> df_final
       col1      col2      col3  col1_mean  col2_mean  col3_mean
0  1.490977  1.784433  0.852842   1.490977   1.784433   0.852842
1  3.726663  2.845369  7.766797   2.608820   2.314901   4.309819
2  0.042541  1.196383  6.568839   1.753394   1.942061   5.062826
3  4.784911  0.444671  8.019933   2.511273   1.567714   5.802103
4  3.831556  0.902672  0.198920   2.775330   1.434705   4.681466
5  3.672763  2.236639  1.528215   2.924902   1.568361   4.155924
6  0.792616  2.604049  0.373296   2.620290   1.716316   3.615549
7  2.281992  2.563639  1.500008   2.578003   1.822232   3.351106
8  4.096861  0.598854  4.934116   2.746765   1.686301   3.526996
9  3.632607  1.502801  0.241920   2.835349   1.667951   3.198489

网友

2楼 · 编辑于 2024-10-01 09:33:54

I am trying to calculate the means of all previous rows for each column of the DataFrame

要获取所有列，可以执行以下操作：

df_means = df.join(df.cumsum()/
                     df.applymap(lambda x:1).cumsum(),
                   r_suffix = "_mean")

但是，如果Team是一个列而不是索引，那么您应该去掉它：

df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
                     df_data.applymap(lambda x:1).cumsum(),
                   r_suffix = "_mean")

你也可以这样做

import numpy as np
df_data = df[[col for col in df.columns 
              if np.issubdtype(df[col],np.number)]]

或者手动定义要取平均值的列的列表cols_for_mean，然后执行以下操作

df_data = df[cols_for_mean]

相关问题更多 >

编程相关推荐

热门问题

热门文章