Python:使用pandas.pivot\表平展活动日志并显示执行活动所花费的时间

DAY event somethingelse timespent logtime 0 2013-01-02 null foo 0.274139 2013-01-02 00:00:00 0 2013-01-02 course1 foo 1.791061 2013-01-02 01:00:00 1 2013-01-02 course1 foo 0.824152 2013-01-02 02:00:00 2 2013-01-02 course1 foo 1.626477 2013-01-02 03:00:00

# dataframe example # How do I record total time spent on event = course (many courses)? # Each record contains logtime which shows datetime to access page # Next record logtime shows the datetime accessing new page and # therefore leaving old page (close enough) # # import pandas as pd import numpy as np import datetime # Creating fake data with string null and course1, course2 df = pd.DataFrame({ 'DAY' : pd.Timestamp('20130102'), 'timespent' : abs(np.random.randn(5)), 'event' : "course1", 'somethingelse' : 'foo' }) df2 = pd.DataFrame({ 'DAY' : pd.Timestamp('20130102'), 'timespent' : abs(np.random.randn(5)), 'event' : "course2", 'somethingelse' : 'foo' }) dfN =pd.DataFrame({ 'DAY' : pd.Timestamp('20130102'), 'timespent' : abs(np.random.randn(1)), 'event' : "null", 'somethingelse' : 'foo' }) dfLog = [dfN, df,df2,dfN,dfN,dfN,df2,dfN,dfN,df,dfN,df2,dfN,df,df2,dfN, ] activity = pd.concat(dfLog) # add time column times = pd.date_range('20130102', periods=activity.shape[0], freq='H') activity['logtime'] = times # activity contains a DAY field (probably not required) # timespent -this is fake time spent on each event. This is # not in my real data but I started this way when faking data # event -either a course or null (not a course) # somethingelse -just there to indicate other data. # print activity # This is quite close to real data. # Fake activity date created above to demo question. # ********************************************* # Actual code to extract time spent on courses # ********************************************* # Lambda function to aggregate data -max and min # Where time diff each minutes. def agg_timespent(a, b): c = abs(b-a) return c # Where the time difference is not explicit but is # record of time recorded when accessing page (course event) def agg_logtime(a, b): # In real data b and a are strings # b = datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S') # a = datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S') c = abs(b-a).seconds return c # Remove 'null' data as that's not of interest here. # null means non course activity e.g. checking email # or timetable -non course stuff. activity= activity[(activity.event != 'null') ] print activity # This shows *just* course activity info # pivot by Day (only 1 day in fake data but 1 year in real data) # Don't need DAY field but helped me fake-up data flattened_v1 = activity.pivot_table(index=['DAY'], values=["timespent"],aggfunc=[min, max],fill_value=0) flattened_v1['time_diff'] = flattened_v1.apply(lambda row: agg_timespent(row[0], row[1]), axis=1) # How to achieve this? # Where NULL has been removed I think this is wrong as NULL records could # indicate several hours gap between course accesses but as # I'm using MAX and MIN then I'm ignoring the periods of null # This is overestimating time on courses # I need to subtract/remove/ignore?? the hours spent on null times flattened_v2 = activity.pivot_table(index=['DAY'], values=["logtime"],aggfunc=[min, max],fill_value=0) flattened_v2['time_diff'] = flattened_v2.apply(lambda row: agg_logtime(row[0], row[1]), axis=1) print print '*****Wrong!**********' print 'This is not what I have but just showing how I thought it might work.' print flattened_v1 print print '******Not sure how to do this*********' print 'This is wrong as nulls/gaps are also included too' print flattened_v2

1条回答

网友

1楼 · 发布于 2024-09-30 02:32:12

你说得对（在你的评论中）：你需要dataframe.shift。你知道吗

如果我对你的问题理解正确，你需要记录自上次时间戳以来经过的时间，因此时间戳表示活动的开始，当上次活动是null时，我们不应该记录任何经过的时间。假设这些都是正确的，使用shift添加一列时间差：

activity['timelog_diff'] = activity['logtime'] - activity['logtime'].shift()

现在第一行将显示特殊的“not a time”值NaT，但这很好，因为我们无法计算经过的时间。接下来，我们可以为刚刚发生null事件的任何已用时间填充更多的NaT值：

mask = activity.event == 'null'
activity.loc[mask.shift(1).fillna(False), 'timelog_diff'] = pd.NaT

当我们想知道在course1上花费了多少时间时，我们必须再次切换，因为对course1行进行索引将生成course1开始的行。我们需要那些course1正在完成/已经完成的：

activity[(activity.event == 'course1').shift().fillna(False)]['timelog_diff'].sum()

在您的示例中，course1返回15小时，course2返回20小时。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章