如何在Pandas中使用Groupby在聚合后取消堆栈

2024-05-19 05:52:43 发布

您现在位置:Python中文网/ 问答频道 /正文

你好,数据科学家和熊猫专家

在应用groupby聚合方法后,我需要一些帮助来找出如何更好地组织数据。我已经尝试取消堆叠到新的数据帧,但它没有产生预期的结果

这是我的数据框:

df = [{'Store': 's1', 'Date': Timestamp('2020-08-01 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-02 00:00:00'), 'Employee': 'a', 'Department': 'd2', 'ID': 's1ad2', 'Level': 2, 'duties': 'C'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-03 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-04 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-05 00:00:00'), 'Employee': 'a', 'Department': 'd2', 'ID': 's1ad2', 'Level': 2, 'duties': 'C'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'a', 'Department': 'd3', 'ID': 's2ad3', 'Level': 2, 'duties': 'C'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'a', 'Department': 'd3', 'ID': 's3ad1', 'Level': 2, 'duties': 'C'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-01 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-02 00:00:00'), 'Employee': 'b', 'Department': 'd2', 'ID': 's1bd2', 'Level': 1, 'duties': 'C'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-03 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-04 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
 {'Store': 's1', 'Date': Timestamp('2020-08-05 00:00:00'), 'Employee': 'b', 'Department': 'd2', 'ID': 's1bd2', 'Level': 1, 'duties': 'C'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2ac1', 'Level': 3, 'duties': 'O'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'c', 'Department': 'd3', 'ID': 's2cd3', 'Level': 3, 'duties': 'C'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2cd1', 'Level': 3, 'duties': 'O'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2cd1', 'Level': 3, 'duties': 'O'},\
 {'Store': 's2', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'c', 'Department': 'd3', 'ID': 's3cd1', 'Level': 3, 'duties': 'C'},\
 {'Store': 's3', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'd', 'Department': 'd1', 'ID': 's3cd1', 'Level': 3, 'duties': 'O'},\
 {'Store': 's3', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'd', 'Department': 'd3', 'ID': 's3dd3', 'Level': 3, 'duties': 'C'},\
 {'Store': 's3', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'd', 'Department': 'd1', 'ID': 's3dd1', 'Level': 3, 'duties': 'O'},\
 {'Store': 's3', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'd','Department': 'd1', 'ID': 's3dd1', 'Level': 3, 'duties': 'O'},\
 {'Store': 's3', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'd', 'Department': 'd3', 'ID': 's3dd1', 'Level': 3, 'duties': 'C'}]

我想组织我的输出,使其在存储中能够很好地堆叠-->;部门-->;受雇者如下所示(很抱歉,输出没有很好地排列):

Store           s1                      s2                                     s3
Department      d1          d2          d1          d3          d1          d3
Employee    ID   Level duties   first shift   last shift #ofshift   first shift   last shift #ofshift   first shift   last shift #ofshift   first shift   last shift #ofshift   first shift   last shift #ofshift   first shift   last shift #ofshift
a   s1ad1 2 O   2020-08-01 2020-08-04 3 
a   s1ad2 2 C               2020-08-02 2020-08-05 2
a   s2ad1 2 O                           2020-08-08 2020-08-11 3
a   s2ad3 2 O                                       2020-08-09 2020-08-12 2
b   s1bd1 1 O   2020-08-01 2020-08-04 3 
b   s1bd2 1 C               2020-08-02 2020-08-05 2
c   s2cd1 3 O                           2020-08-08 2020-08-11 3
c   s3ad3 3 O                                       2020-08-09 2020-08-12 2
d   s3dd1 3 O                                                   2020-08-08 2020-08-11 3
d   s3dd3 3 O                                                               2020-08-09 2020-08-12 2

因此,我试着按表达式分组如下:

df = df.groupby(['Employee', 'Store', 'Department'])\
                .agg({'Date':['first', 'last', 'size'],
                      'ID': 'first',
                      'Level': 'first',
                      'duties': 'first'})

# Join the Each Column with its operation.
df.columns = df.columns.map('_'.join)     

# Reset the Index
df = df.reset_index().set_index('Employee')

# Renaming Columns of Dataframe.
df.rename(columns={'Date_first':'First Shift',
        'Date_last':'Last Shift',
        'Date_size':'# of shift',
        'ID_first':'ID',
        'Level_first':'Level',
        'duties_first':'duties'},
        inplace=True)

这将打印以下结果:

            Store  Department  First Shift       Last Shift      # of shift    ID_first     Level    duties
Employee
a               s1         d1                    2020-08-01    2020-08-04           3          s1ad1             2      O
a               s1         d2                    2020-08-02   2020-08-05           2           s1ad2            2      C
a               s2         d1                    2020-08-08   2020-08-11            3           s2ad1            2      O
a               s2         d3                   2020-08-09    2020-08-12           2           s2ad3           2      C
b               s1         d1                    2020-08-01     2020-08-04           3          s1bd1            1      O
b               s1         d2                   2020-08-02     2020-08-05           2          s1bd2           1      C
c               s2         d1                   2020-08-08     2020-08-11            3           s2ac1           3      O
c               s2         d3                  2020-08-09     2020-08-12           2           s2cd3          3      C
d               s3         d1                   2020-08-08    2020-08-11           3           s3cd1           3      O
d               s3         d3                  2020-08-09     2020-08-12           2         s3dd3           3      C

然后我应用了unstack表达式,如下所示:

df = df.groupby(['Employee', 'Store', 'Department', 'First Shift', 'Last Shift', '# of shift', 'ID_first',\
                     'Level', 'duties'])\
            .size()\
            .unstack(['Store', 'Department']).fillna(0)

它将打印结果,如下所示:

Store                                                              s1        s2        s3     
Department                                                         d1   d2   d1   d3   d1   d3
Employee First Shift Last Shift # of shift ID_first Level duties
a        2020-08-01  2020-08-04 3          s1ad1    2     O       1.0  0.0  0.0  0.0  0.0  0.0
         2020-08-02  2020-08-05 2          s1ad2    2     C       0.0  1.0  0.0  0.0  0.0  0.0
         2020-08-08  2020-08-11 3          s2ad1    2     O       0.0  0.0  1.0  0.0  0.0  0.0
         2020-08-09  2020-08-12 2          s2ad3    2     C       0.0  0.0  0.0  1.0  0.0  0.0
b        2020-08-01  2020-08-04 3          s1bd1    1     O       1.0  0.0  0.0  0.0  0.0  0.0
         2020-08-02  2020-08-05 2          s1bd2    1     C       0.0  1.0  0.0  0.0  0.0  0.0
c        2020-08-08  2020-08-11 3          s2ac1    3     O       0.0  0.0  1.0  0.0  0.0  0.0
         2020-08-09  2020-08-12 2          s2cd3    3     C       0.0  0.0  0.0  1.0  0.0  0.0
d        2020-08-08  2020-08-11 3          s3cd1    3     O       0.0  0.0  0.0  0.0  1.0  0.0
         2020-08-09  2020-08-12 2          s3dd3    3     C       0.0  0.0  0.0  0.0  0.0  1.0

我认为我错误地使用了大小和展开。然而,我似乎不知道如何重新组织数据

我非常感谢专家对如何正确组织我的数据的意见

再次感谢您的帮助和关心

多谢各位


Tags: storeiddfdateshiftemployeeleveltimestamp
1条回答
网友
1楼 · 发布于 2024-05-19 05:52:43

问题的关键在于,在使用.unstack()之前需要重新构造数据,因为所需的格式是一个矩阵,其值是三个重复的列。因此,您需要将数据帧从宽改为长,并在一列Values中使用这三个值创建一个新列,在另一列中对它们进行分类Shift

# Step 1: Named Groupby Agregation naming columns ins specific format required for `pd.wide_to_long`. Must end with integer.
df = (df.groupby(['Employee', 'Store', 'Department'])
        .agg(Shift_1=('Date','first'),
             Shift_2=('Date','last'),
             Shift_3=('Date','size'),
             ID=('ID', 'first'),
             Level=('Level', 'first'),
             duties=('duties', 'first'))
        .reset_index())


# Step 2: In preparation for a matrix The data must be transformed so that the three columns that are values in the matrix must be in long format.
df = pd.wide_to_long(df, stubnames='Shift_', i='ID', j='Shift').reset_index().rename(columns={'Shift_':'Values'})

# Step 3: 1,2,3 were required integer suffixes for wide_to_long but now let's change to what we want the columns to be called.
df['Shift'] = df['Shift'].replace([1,2,3],['First Shift','Last Shift','# of shift'])

# Step 4: Create the matrix be setting index and unstacking to columns
df = (df.sort_values(['Employee', 'Store', 'Department']) #values must be sorted in order for how we want columns to appear in matrix format
        .set_index(['Employee', 'ID', 'Level', 'duties', 'Store', 'Department', 'Shift'])
        .unstack(['Store', 'Department', 'Shift']).fillna(0)) 

# Step 5: Cleanup of Multi-index into desred format
df.columns = df.columns.reorder_levels([1,2,3,0]).droplevel(3)
df = df.reset_index()
df
Out[1]: 
Store      Employee     ID Level duties                   s1  \
Department                                                d1   
Shift                                            First Shift   
0                 a  s1ad1     2      O  2020-08-01 00:00:00   
1                 a  s1ad2     2      C                    0   
2                 a  s2ad1     2      O                    0   
3                 a  s2ad3     2      C                    0   
4                 b  s1bd1     1      O  2020-08-01 00:00:00   
5                 b  s1bd2     1      C                    0   
6                 c  s2ac1     3      O                    0   
7                 c  s2cd3     3      C                    0   
8                 d  s3cd1     3      O                    0   
9                 d  s3dd3     3      C                    0   

Store                                                            \
Department                                                   d2   
Shift                Last Shift # of shift          First Shift   
0           2020-08-04 00:00:00          3                    0   
1                             0          0  2020-08-02 00:00:00   
2                             0          0                    0   
3                             0          0                    0   
4           2020-08-04 00:00:00          3                    0   
5                             0          0  2020-08-02 00:00:00   
6                             0          0                    0   
7                             0          0                    0   
8                             0          0                    0   
9                             0          0                    0   

Store                                       ...         s2  \
Department                                  ...         d1   
Shift                Last Shift # of shift  ... # of shift   
0                             0          0  ...          0   
1           2020-08-05 00:00:00          2  ...          0   
2                             0          0  ...          3   
3                             0          0  ...          0   
4                             0          0  ...          0   
5           2020-08-05 00:00:00          2  ...          0   
6                             0          0  ...          3   
7                             0          0  ...          0   
8                             0          0  ...          0   
9                             0          0  ...          0   

Store                                                            \
Department                   d3                                   
Shift               First Shift           Last Shift # of shift   
0                             0                    0          0   
1                             0                    0          0   
2                             0                    0          0   
3           2020-08-09 00:00:00  2020-08-12 00:00:00          2   
4                             0                    0          0   
5                             0                    0          0   
6                             0                    0          0   
7           2020-08-09 00:00:00  2020-08-12 00:00:00          2   
8                             0                    0          0   
9                             0                    0          0   

Store                        s3                                  \
Department                   d1                                   
Shift               First Shift           Last Shift # of shift   
0                             0                    0          0   
1                             0                    0          0   
2                             0                    0          0   
3                             0                    0          0   
4                             0                    0          0   
5                             0                    0          0   
6                             0                    0          0   
7                             0                    0          0   
8           2020-08-08 00:00:00  2020-08-11 00:00:00          3   
9                             0                    0          0   

Store                                                            
Department                   d3                                  
Shift               First Shift           Last Shift # of shift  
0                             0                    0          0  
1                             0                    0          0  
2                             0                    0          0  
3                             0                    0          0  
4                             0                    0          0  
5                             0                    0          0  
6                             0                    0          0  
7                             0                    0          0  
8                             0                    0          0  
9           2020-08-09 00:00:00  2020-08-12 00:00:00          2  

[10 rows x 22 columns]

相关问题 更多 >

    热门问题