Python:Groupby多次执行操作并输出新的摘要df

file_name iteration x y z 0 File_1 A 0.000000 0.561140 1 1 File_1 A 0.285714 0.412811 3 2 File_1 A 0.571429 0.957206 7 3 File_1 A 0.857143 0.667926 3 4 File_1 B 1.142857 0.998238 5 5 File_1 B 1.428571 0.975049 3 6 File_1 B 1.714286 0.690555 4 7 File_1 B 2.000000 0.896192 9 8 File_2 C 0.000000 0.241320 0 9 File_2 C 0.285714 0.476498 0 10 File_2 C 0.571429 0.295382 9 11 File_2 C 0.857143 0.809189 6 12 File_2 D 1.142857 0.601656 6 13 File_2 D 1.428571 0.447752 0 14 File_2 D 1.714286 0.210794 3 15 File_2 D 2.000000 0.788367 6

import numpy as np import pandas as pd #Example functions i want to apply to my grouped df (take as input some columns and output a number) def fun_a (x, y): #function that returns 1 value for each grp, example1: return (sum(x)+ sum(y)//3*max(x)) def fun_b (x, z): return (np.mean(x) -np.median(z)) #Example dataframe (Real Dataframe has 3 columns to grouby on) data= { "file_name" : ["File_1"]*8 + ["File_2"]*8, "iteration" : ["A"]*4 + ["B"]*4 + ["C"]*4 + ["D"]*4, "x" : list(np.linspace(0,2,8))*2, "y" : list(np.random.random(16)), # 4 times x "z" : list(np.random.randint(0,10,16)) } df = pd.DataFrame(data) #--------------------------# #How I am currently solving it: nested groupby: #storing all values of final_df in lists file_name_lst = [] iterations_name_lst = [] fun_a_lst = [] fun_b_lst = [] #grouping by filename--> iteration --> applyiing for key1, grp1 in df.groupby(["file_name"]): for key2, grp2 in grp1.groupby(['iteration']): #filter out of grp2 based on column values # grp2 = grp2[grp2['x'] != "SOME_VALUE_I_AM_FILTERING_OUT"] #In some cases filter out second half of grp2 (measurment repeated) # grp2 = grp2[:len(grp2)//2] #sometimes filter out all everyting smaller than value. # grp2 = grp2 [grp2"x" > some value] #keep only every 5th value (measurments have been repeated 5 times) # grp2 = grp2 [grp2[::5]] #append the correspodin key values to lists file_name_lst.append(key1) iterations_name_lst.append(key2) #apend the result of applying the functions to lists fun_a_lst.append(fun_a(x=grp2['x'], y=grp2['y'])) fun_b_lst.append(fun_b(x=grp2['x'], z=grp2['z'])) df_final = pd.DataFrame ({ "file_name": file_name_lst, "iteration": iterations_name_lst, "fun_a_res": fun_a_lst, "fun_b_res": fun_b_lst }) print (df) print ("\n summary dataframe: \n") print (df_final)

file_name iteration x y z 0 File_1 A 0.000000 0.998349 1 1 File_1 A 0.285714 0.073390 5 2 File_1 A 0.571429 0.188568 9 3 File_1 A 0.857143 0.236174 3 4 File_1 B 1.142857 0.934034 4 5 File_1 B 1.428571 0.906336 6 6 File_1 B 1.714286 0.258148 4 7 File_1 B 2.000000 0.717714 9 8 File_2 C 0.000000 0.137624 1 9 File_2 C 0.285714 0.773759 6 10 File_2 C 0.571429 0.902748 9 11 File_2 C 0.857143 0.186516 8 12 File_2 D 1.142857 0.496885 6 13 File_2 D 1.428571 0.074975 1 14 File_2 D 1.714286 0.132376 3 15 File_2 D 2.000000 0.421217 2 summary dataframe: file_name iteration fun_a_res fun_b_res 0 File_1 A 1.714286 -3.571429 1 File_1 B 6.285714 -3.428571 2 File_2 C 1.714286 -6.571429 3 File_2 D 6.285714 -0.928571

1条回答

网友
1楼 · 发布于 2024-09-28 21:18:32

在主聚合之前，过滤掉不需要的行仍然更容易。过滤和聚合步骤的分离也使代码更易于维护
代码
1。过滤
为每次运行从（a）、（b）或（c）中选择一个
# get running number and group size dfgp = df.groupby(["file_name", "iteration"]) df["cumcount"] = dfgp.cumcount() df["size"] = dfgp["cumcount"].transform("size") # (a) filter out the last 50% per group df = df[df["cumcount"] < df["size"] // 2] # (b) filter out everyting smaller than 0.5 df = df[df["x"] >= 0.5] # (c) keep only every 2 valuees df = df[df["cumcount"] % 2 == 0]
2。聚合
groupby-apply构造用于计算涉及多个列的聚合
dfgp = df.groupby(["file_name", "iteration"]) # regroup on the new df df_ans = pd.concat([ dfgp[["x", "y"]].apply(lambda df_: fun_a(df_["x"], df_["y"])).rename("fun_a_res"), dfgp[["x", "z"]].apply(lambda df_: fun_b(df_["x"], df_["z"])).rename("fun_b_res") ], axis=1).reset_index()
结果
（a）
print(df_ans) file_name iteration fun_a_res fun_b_res 0 File_1 A 0.285714 -1.857143 1 File_1 B 2.571428 -2.714286 2 File_2 C 0.285714 0.142857 3 File_2 D 2.571428 -1.714286
（b）
file_name iteration fun_a_res fun_b_res 0 File_1 A 1.428572 -4.285714 1 File_1 B 8.285714 -2.928571 2 File_2 C 1.428572 -6.785714 3 File_2 D 6.285714 -2.928571
（c）
file_name iteration fun_a_res fun_b_res 0 File_1 A 0.571429 -3.714285 1 File_1 B 2.857143 -3.071429 2 File_2 C 0.571429 -4.214285 3 File_2 D 2.857143 -3.071429

代码

结果

相关问题更多 >

编程相关推荐

热门问题

热门文章