Python:使用pandas.pivot\表平展活动日志并显示执行活动所花费的时间

2024-09-30 02:32:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在研究使用python和pandas来展平我们的VLE(Blackboard inc.)活动表。我试图计算每天访问课程所花费的总时间,而不是在活动日志/表中执行其他非课程活动。你知道吗

我在下面创建了一些伪数据和代码(python)来模拟这个问题以及我在哪里挣扎。这是我正在努力解决的问题,因为这和我的实际情况很接近。你知道吗

日志数据通常如下所示,我在下面的代码示例中创建了它:(下面代码中的activity dataframe)

         DAY    event somethingelse  timespent             logtime
0 2013-01-02     null           foo   0.274139 2013-01-02 00:00:00
0 2013-01-02  course1           foo   1.791061 2013-01-02 01:00:00
1 2013-01-02  course1           foo   0.824152 2013-01-02 02:00:00
2 2013-01-02  course1           foo   1.626477 2013-01-02 03:00:00

在真实数据中有一个叫做logtime的字段。这是一个实际的datetime,而不是一个time-spend字段(在我试验时也包含在我的假数据中)。你知道吗

如何记录在event=course(多个课程)上花费的总时间(使用logtime)?你知道吗

每个记录都包含显示访问页面的日期时间的logtime Next record logtime显示访问新页面并因此离开旧页面的日期时间(足够近)。如何获取事件不为空的总时间。如果我只使用max/min值,那么这会导致高估,因为课程访问中的间隙(event=null)也包括在内。我简化了数据,使每个记录增加1小时,这不是真正的情况。你知道吗

谢谢你给我小费 杰森

代码是:

# dataframe example
# How do I record total time spent on event = course (many courses)?
# Each record contains logtime which shows datetime to access page
# Next record logtime shows the datetime accessing new page and
# therefore leaving old page (close enough)
# 
#

import pandas as pd
import numpy as np
import datetime


# Creating fake data with string null and course1, course2
df = pd.DataFrame({
    'DAY' : pd.Timestamp('20130102'),
    'timespent' : abs(np.random.randn(5)),
    'event' : "course1",
    'somethingelse' : 'foo' })

df2 = pd.DataFrame({
    'DAY' : pd.Timestamp('20130102'),
    'timespent' : abs(np.random.randn(5)),
    'event' : "course2",
    'somethingelse' : 'foo' })

dfN =pd.DataFrame({
    'DAY' : pd.Timestamp('20130102'),
    'timespent' : abs(np.random.randn(1)),
    'event' : "null",
    'somethingelse' : 'foo' })


dfLog = [dfN, df,df2,dfN,dfN,dfN,df2,dfN,dfN,df,dfN,df2,dfN,df,df2,dfN, ]
activity = pd.concat(dfLog)
# add time column
times = pd.date_range('20130102', periods=activity.shape[0], freq='H')
activity['logtime'] = times

# activity contains a DAY field (probably not required)
# timespent -this is fake time spent on each event. This is
# not in my real data but I started this way when faking data
# event -either a course or null (not a course)
# somethingelse -just there to indicate other data. 
#

print activity # This is quite close to real data.

# Fake activity date created above to demo question.

# *********************************************
# Actual code to extract time spent on courses
# *********************************************

# Lambda function to aggregate data -max and min

# Where time diff each minutes.
def agg_timespent(a, b):
    c = abs(b-a)
    return c

# Where the time difference is not explicit but is 
# record of time recorded when accessing page (course event)
def agg_logtime(a, b):
    # In real data b and a are strings
    # b = datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S')
    # a = datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S')
    c = abs(b-a).seconds
    return c    



# Remove 'null' data as that's not of interest here. 
# null means non course activity e.g. checking email
# or timetable -non course stuff.
activity= activity[(activity.event != 'null') ]

print activity  # This shows *just* course activity info

# pivot by Day (only 1 day in fake data but 1 year in real data)
# Don't need DAY field but helped me fake-up data
flattened_v1 = activity.pivot_table(index=['DAY'], values=["timespent"],aggfunc=[min, max],fill_value=0)
flattened_v1['time_diff'] = flattened_v1.apply(lambda row: agg_timespent(row[0], row[1]), axis=1)


# How to achieve this?
# Where NULL has been removed I think this is wrong as NULL records could
# indicate several hours gap between course accesses but as
# I'm using MAX and MIN then I'm ignoring the periods of null
# This is overestimating time on courses
# I need to subtract/remove/ignore?? the hours spent on null times

flattened_v2 = activity.pivot_table(index=['DAY'], values=["logtime"],aggfunc=[min, max],fill_value=0)
flattened_v2['time_diff'] = flattened_v2.apply(lambda row: agg_logtime(row[0], row[1]), axis=1)

print
print '*****Wrong!**********'
print 'This is not what I have but just showing how I thought it might work.'
print flattened_v1
print
print '******Not sure how to do this*********'
print 'This is wrong as nulls/gaps are also included too'
print flattened_v2

Tags: toeventdatadatetimetimeisactivitynull
1条回答
网友
1楼 · 发布于 2024-09-30 02:32:12

你说得对(在你的评论中):你需要dataframe.shift。你知道吗

如果我对你的问题理解正确,你需要记录自上次时间戳以来经过的时间,因此时间戳表示活动的开始,当上次活动是null时,我们不应该记录任何经过的时间。假设这些都是正确的,使用shift添加一列时间差:

activity['timelog_diff'] = activity['logtime'] - activity['logtime'].shift()

现在第一行将显示特殊的“not a time”值NaT,但这很好,因为我们无法计算经过的时间。接下来,我们可以为刚刚发生null事件的任何已用时间填充更多的NaT值:

mask = activity.event == 'null'
activity.loc[mask.shift(1).fillna(False), 'timelog_diff'] = pd.NaT

当我们想知道在course1上花费了多少时间时,我们必须再次切换,因为对course1行进行索引将生成course1开始的行。我们需要那些course1正在完成/已经完成的:

activity[(activity.event == 'course1').shift().fillna(False)]['timelog_diff'].sum()

在您的示例中,course1返回15小时,course2返回20小时。你知道吗

相关问题 更多 >

    热门问题