Dask数据帧计算多列分组内的平均值

sub_erp_pd= pd.DataFrame() for j in range(1,4): sub_c=subp[subp['condition']==j] for i in range(1,3073): sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)

%%time sub_erp=pd.DataFrame() for subno in progressbar.progressbar(range(1,82)): try: sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None) except: sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None) sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())

1条回答

网友

1楼 · 发布于 2024-05-20 19:23:08

如果我理解正确，你需要

使用^{}（阅读更多here）对subject、condition和sample列进行分组
- 这将把所有在这三列中都具有相同值的行聚集到一个组中
使用.mean()取平均值
- 这将给出每组的平均值

Generate一些虚拟数据

df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
                        columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)

print(df.head())
   subject  trial  condition  sample
0        1     71         96      34
1        1      2         89      66
2        1     90         90      81
3        1     93         43      18
4        1     29         82      32

熊猫接近

聚合并获取mean

df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()

print(df_grouped.head(15))
    subject  condition  sample  trial
0         1         18      24     89
1         1         43      18     93
2         1         67      47     81
3         1         82      32     29
4         1         85      28     97
5         1         88      13     48
6         1         89      59     23
7         1         89      66      2
8         1         90      81     90
9         1         96      34     71
10        2          0      81     19
11        2          2      39     58
12        2          2      59     94
13        2          5      42     13
14        2          9      42      4

Dask方法

第一步。进口

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

第二步。使用^{}将熊猫DataFrame转换为Dask DataFrame

ddf = dd.from_pandas(df, npartitions=2)

第三步。聚合并获取mean

ddf_grouped = (
    ddf.groupby(['subject','condition','sample'])['trial']
        .mean()
        .reset_index(drop=False)
            )

with ProgressBar():
    df_grouped = ddf_grouped.compute()
[                                        ] | 0% Completed |  0.0s
[########################################] | 100% Completed |  0.1s

print(df_grouped.head(15))
    subject  condition  sample  trial
0         1         18      24     89
1         1         43      18     93
2         1         67      47     81
3         1         82      32     29
4         1         85      28     97
5         1         88      13     48
6         1         89      59     23
7         1         89      66      2
8         1         90      81     90
9         1         96      34     71
10        2          0      81     19
11        2          2      39     58
12        2          2      59     94
13        2          5      42     13
14        2          9      42      4

重要提示：本答案中的方法不使用创建空Dask数据帧并向其附加值的方法来计算受试者、条件和试验分组内的平均值。相反，这个答案提供了另一种方法（使用GROUP BY）来获得期望的最终结果（计算受试者、条件和试验组内的平均值）

相关问题更多 >

编程相关推荐

热门问题

热门文章