如何在python中按多个列分组

2024-06-25 22:50:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我想根据用户ID、日期、类别(每天使用的频率)、每个类别的最大持续时间以及一天中使用最多的部分,将数据帧按不同列进行分组,最后将结果存储在.csv文件中

name     duration  UserId  category    part_of_day    Date 
Settings   3.436    1    System tool      evening   2020-09-10 
Calendar   2.167    1    Calendar         night     2020-09-11 
Calendar   5.705    1    Calendar         night     2020-09-11 
Messages   7.907    1   Phone_and_SMS     night     2020-09-11 
Instagram   50.285   9   Social            night    2020-09-28  
Drive       30.260   9  Productivity       night    2020-09-28   

df.groupby(["UserId", "Date","category"])["category"].count()

我的代码结果是:

 UserId      Date        category               
1       2020-09-10    System tool                  1
       2020-09-11     Calendar                     8
                     Clock                         2
                    Communication                  86
                  Health & Fitness                 5     

但是我想要这个结果

 UserId      Date        category             count(category)  max-duration 
1       2020-09-10  System tool                  1            3

        2020-09-11  Calendar                     2            5

2       2020-09-28    Social                     1            50
                      Productivity               1            30

我该怎么做?我找不到任何解决方案想要的结果


Tags: 用户iddatecountsocialtool类别system
3条回答

资料

df = pd.DataFrame({'name    ': {0: 'Settings', 1: 'Calendar', 2: 'Calendar', 3: 'Messages', 4: 'Instagram', 5: 'Drive'}, '  duration': {0: 3.4360000000000004, 1: 2.167, 2: 5.705, 3: 7.907, 4: 50.285, 5: 30.26}, ' UserId': {0: 1, 1: 1, 2: 1, 3: 1, 4: 9, 5: 9}, '  category': {0: '       System tool', 1: '       Calendar', 2: '       Calendar', 3: '       Phone_and_SMS', 4: '       Social', 5: '       Productivity'}, '     part_of_day': {0: '  evening', 1: '     night  ', 2: '     night  ', 3: 'night  ', 4: '       night  ', 5: ' night  '}, ' Date': {0: '     2020-09-10', 1: '     2020-09-11', 2: '     2020-09-11', 3: '     2020-09-11', 4: '     2020-09-28', 5: '     2020-09-28'}})
df.columns = df.columns.str.strip()

df:

        name  duration  UserId              category     part_of_day             Date
0   Settings     3.436       1           System tool         evening       2020-09-10
1   Calendar     2.167       1              Calendar         night         2020-09-11
2   Calendar     5.705       1              Calendar         night         2020-09-11
3   Messages     7.907       1         Phone_and_SMS         night         2020-09-11
4  Instagram    50.285       9                Social         night         2020-09-28
5      Drive    30.260       9          Productivity         night         2020-09-28
grouping = df.groupby(["UserId", "Date","category"]).agg({"category": 'count', 'duration':max}).rename(columns={"duration" : "max-duration"})

分组:

                                             category  max-duration
UserId Date            category                                    
1           2020-09-10        System tool           1         3.436
            2020-09-11        Calendar              2         5.705
                              Phone_and_SMS         1         7.907
9           2020-09-28        Productivity          1        30.260
                              Social                1        50.285

您可以利用以下格式的pandas.DataFrame.groupbypandas.DataFrame.aggregatepandas.DataFrame.rename在一行中生成所需的输出:


代码:

import pandas as pd

df = pd.DataFrame({'name': ['Settings','Calendar','Calendar', 'Messages', 'Instagram', 'Drive'],
                   'duration': [3.436, 2.167, 5.7050, 7.907, 50.285, 30.260],
                   'UserId': [1, 1, 1, 1, 2, 2],
                   'category' : ['System_tool', 'Calendar', 'Calendar', 'Phone_and_SMS', 'Social', 'Productivity'],
                   'part_of_day' : ['evening', 'night','night','night','night','night' ],
                   'Date' : ['2020-09-10', '2020-09-11', '2020-09-11', '2020-09-11', '2020-09-28', '2020-09-28'] })

df.groupby(['UserId', 'Date', 'category']).aggregate( count_cat = ('category', 'count'), max_duration = ('duration', 'max'))

输出:

output from one line

使用agg

df.groupby(["UserId", "Date","category"]).agg({'category':'count',
                                               'Date': np.ptp})

或者用lambda x: x.max() - x.min()替换np.ptp

相关问题 更多 >