我有一个数据帧:
file_name iteration x y z
0 File_1 A 0.000000 0.561140 1
1 File_1 A 0.285714 0.412811 3
2 File_1 A 0.571429 0.957206 7
3 File_1 A 0.857143 0.667926 3
4 File_1 B 1.142857 0.998238 5
5 File_1 B 1.428571 0.975049 3
6 File_1 B 1.714286 0.690555 4
7 File_1 B 2.000000 0.896192 9
8 File_2 C 0.000000 0.241320 0
9 File_2 C 0.285714 0.476498 0
10 File_2 C 0.571429 0.295382 9
11 File_2 C 0.857143 0.809189 6
12 File_2 D 1.142857 0.601656 6
13 File_2 D 1.428571 0.447752 0
14 File_2 D 1.714286 0.210794 3
15 File_2 D 2.000000 0.788367 6
我想:
下面是一个工作示例代码。但是,它很难阅读,而且只要添加更多的groupby操作(以及我存储值的列表),就会很快变长。我的实际代码使用3个grouby,函数都采用列的组合并输出单个值
import numpy as np
import pandas as pd
#Example functions i want to apply to my grouped df (take as input some columns and output a number)
def fun_a (x, y):
#function that returns 1 value for each grp, example1:
return (sum(x)+ sum(y)//3*max(x))
def fun_b (x, z):
return (np.mean(x) -np.median(z))
#Example dataframe (Real Dataframe has 3 columns to grouby on)
data= {
"file_name" : ["File_1"]*8 + ["File_2"]*8,
"iteration" : ["A"]*4 + ["B"]*4 + ["C"]*4 + ["D"]*4,
"x" : list(np.linspace(0,2,8))*2,
"y" : list(np.random.random(16)), # 4 times x
"z" : list(np.random.randint(0,10,16))
}
df = pd.DataFrame(data)
#--------------------------#
#How I am currently solving it: nested groupby:
#storing all values of final_df in lists
file_name_lst = []
iterations_name_lst = []
fun_a_lst = []
fun_b_lst = []
#grouping by filename--> iteration --> applyiing
for key1, grp1 in df.groupby(["file_name"]):
for key2, grp2 in grp1.groupby(['iteration']):
#filter out of grp2 based on column values
# grp2 = grp2[grp2['x'] != "SOME_VALUE_I_AM_FILTERING_OUT"]
#In some cases filter out second half of grp2 (measurment repeated)
# grp2 = grp2[:len(grp2)//2]
#sometimes filter out all everyting smaller than value.
# grp2 = grp2 [grp2"x" > some value]
#keep only every 5th value (measurments have been repeated 5 times)
# grp2 = grp2 [grp2[::5]]
#append the correspodin key values to lists
file_name_lst.append(key1)
iterations_name_lst.append(key2)
#apend the result of applying the functions to lists
fun_a_lst.append(fun_a(x=grp2['x'], y=grp2['y']))
fun_b_lst.append(fun_b(x=grp2['x'], z=grp2['z']))
df_final = pd.DataFrame ({
"file_name": file_name_lst,
"iteration": iterations_name_lst,
"fun_a_res": fun_a_lst,
"fun_b_res": fun_b_lst
})
print (df)
print ("\n summary dataframe: \n")
print (df_final)
输出:
file_name iteration x y z
0 File_1 A 0.000000 0.998349 1
1 File_1 A 0.285714 0.073390 5
2 File_1 A 0.571429 0.188568 9
3 File_1 A 0.857143 0.236174 3
4 File_1 B 1.142857 0.934034 4
5 File_1 B 1.428571 0.906336 6
6 File_1 B 1.714286 0.258148 4
7 File_1 B 2.000000 0.717714 9
8 File_2 C 0.000000 0.137624 1
9 File_2 C 0.285714 0.773759 6
10 File_2 C 0.571429 0.902748 9
11 File_2 C 0.857143 0.186516 8
12 File_2 D 1.142857 0.496885 6
13 File_2 D 1.428571 0.074975 1
14 File_2 D 1.714286 0.132376 3
15 File_2 D 2.000000 0.421217 2
summary dataframe:
file_name iteration fun_a_res fun_b_res
0 File_1 A 1.714286 -3.571429
1 File_1 B 6.285714 -3.428571
2 File_2 C 1.714286 -6.571429
3 File_2 D 6.285714 -0.928571
是否有一种更简洁的方法来创建一个数据帧,该数据帧由应用于前一个数据帧的groupby的函数的结果组成
在主聚合之前,过滤掉不需要的行仍然更容易。过滤和聚合步骤的分离也使代码更易于维护
代码
1。过滤
为每次运行从(a)、(b)或(c)中选择一个
2。聚合
groupby-apply
构造用于计算涉及多个列的聚合结果
(a)
(b)
(c)
相关问题 更多 >
编程相关推荐