如何对嵌套数据帧中的分组数组进行操作?

2024-09-28 01:25:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一系列嵌套的数据帧,其中包含几个(数百个)数组,我想在不同的嵌套级别上平均每个变量。你知道吗

变量mydatadf包含实际数据的一个非常简单的代表性示例。你知道吗

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

mydata = dict()
participant = ['participantA', 'participantB']
for p in participant:
    ses = dict()
    session = ['ses_1', 'ses_2']
    for s in session:
        series = dict()
        set = ['s_1', 's_2', 's_3']
        for se in set:
            reps = dict()
            rep = ['r_1', 'r_2', 'r_3', 'r_4', 'r_5']
            for r in rep:
                vars = dict()
                vars = {'var1': np.sin(np.random.rand(1000)*2),
                        'var2': np.sin(np.random.rand(1000)*2)}
                varsdf = pd.DataFrame(data=vars)
                reps[r] = vars
            series[se] = reps
        ses[s] = series
    mydata[p] = ses
mydatadf = pd.DataFrame(mydata)

我如何有效地平均(例如)var1整个嵌套级别repsseriesses和/或participant?你知道吗

最后,我将绘制所有var1对象,并用不同的颜色高亮显示所有所需嵌套级别的平均数据。你知道吗

for p in mydatadf.keys():
    for ses in mydatadf[p].keys():
        for set in mydatadf[p][ses].keys():
            for rep in mydatadf[p][ses][set].keys():
                data = mydatadf[p][ses][set][rep]['var1']
                plt.plot(data)
plt.show()

Tags: 数据infornpvarskeys级别dict
1条回答
网友
1楼 · 发布于 2024-09-28 01:25:03

您总是可以展平数据帧并执行标准的groupby操作(我不知道它是否是最佳的,但它是有效的):

df = pd.io.json.json_normalize(mydata)   #this will give a nested dataframe
df_flat = pd.DataFrame(df.T.index.str.split('.').tolist()).assign(values=df.T.values)


df_flat.head(3)
>>   0      1    2    3     4  \
0  participantA  ses_1  s_1  r_1  var1   
1  participantA  ses_1  s_1  r_1  var2   
2  participantA  ses_1  s_1  r_2  var1   

                                              values  
0  [0.7267196257553268, 0.9822775511169437, 0.991...  
1  [0.6633676714415264, 0.2823588336690545, 0.977...  
2  [0.2211576389168905, 0.9399581790280525, 0.645...  

编辑:按分组并应用函数(例如,平均值):

# in this case I choose column 4, corresponding to 'var'.
# You can change the name of the column using df_flat.columns.rename
# note that I use np.hstack as you are dealing with a an array of arrays
column = 4   
df_flat.groupby(column)['Values'].apply(lambda x: np.hstack(x).mean())
>> 4
var1    0.707803
var2    0.707821
Name: Values, dtype: float64

相关问题 更多 >

    热门问题