我正在研究使用python和pandas来展平我们的VLE(Blackboard inc.)活动表。我试图计算每天访问课程所花费的总时间,而不是在活动日志/表中执行其他非课程活动。你知道吗
我在下面创建了一些伪数据和代码(python)来模拟这个问题以及我在哪里挣扎。这是我正在努力解决的问题,因为这和我的实际情况很接近。你知道吗
日志数据通常如下所示,我在下面的代码示例中创建了它:(下面代码中的activity dataframe)
DAY event somethingelse timespent logtime
0 2013-01-02 null foo 0.274139 2013-01-02 00:00:00
0 2013-01-02 course1 foo 1.791061 2013-01-02 01:00:00
1 2013-01-02 course1 foo 0.824152 2013-01-02 02:00:00
2 2013-01-02 course1 foo 1.626477 2013-01-02 03:00:00
在真实数据中有一个叫做logtime的字段。这是一个实际的datetime,而不是一个time-spend字段(在我试验时也包含在我的假数据中)。你知道吗
如何记录在event=course(多个课程)上花费的总时间(使用logtime)?你知道吗
每个记录都包含显示访问页面的日期时间的logtime Next record logtime显示访问新页面并因此离开旧页面的日期时间(足够近)。如何获取事件不为空的总时间。如果我只使用max/min值,那么这会导致高估,因为课程访问中的间隙(event=null)也包括在内。我简化了数据,使每个记录增加1小时,这不是真正的情况。你知道吗
谢谢你给我小费 杰森
代码是:
# dataframe example
# How do I record total time spent on event = course (many courses)?
# Each record contains logtime which shows datetime to access page
# Next record logtime shows the datetime accessing new page and
# therefore leaving old page (close enough)
#
#
import pandas as pd
import numpy as np
import datetime
# Creating fake data with string null and course1, course2
df = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course1",
'somethingelse' : 'foo' })
df2 = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course2",
'somethingelse' : 'foo' })
dfN =pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(1)),
'event' : "null",
'somethingelse' : 'foo' })
dfLog = [dfN, df,df2,dfN,dfN,dfN,df2,dfN,dfN,df,dfN,df2,dfN,df,df2,dfN, ]
activity = pd.concat(dfLog)
# add time column
times = pd.date_range('20130102', periods=activity.shape[0], freq='H')
activity['logtime'] = times
# activity contains a DAY field (probably not required)
# timespent -this is fake time spent on each event. This is
# not in my real data but I started this way when faking data
# event -either a course or null (not a course)
# somethingelse -just there to indicate other data.
#
print activity # This is quite close to real data.
# Fake activity date created above to demo question.
# *********************************************
# Actual code to extract time spent on courses
# *********************************************
# Lambda function to aggregate data -max and min
# Where time diff each minutes.
def agg_timespent(a, b):
c = abs(b-a)
return c
# Where the time difference is not explicit but is
# record of time recorded when accessing page (course event)
def agg_logtime(a, b):
# In real data b and a are strings
# b = datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S')
# a = datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S')
c = abs(b-a).seconds
return c
# Remove 'null' data as that's not of interest here.
# null means non course activity e.g. checking email
# or timetable -non course stuff.
activity= activity[(activity.event != 'null') ]
print activity # This shows *just* course activity info
# pivot by Day (only 1 day in fake data but 1 year in real data)
# Don't need DAY field but helped me fake-up data
flattened_v1 = activity.pivot_table(index=['DAY'], values=["timespent"],aggfunc=[min, max],fill_value=0)
flattened_v1['time_diff'] = flattened_v1.apply(lambda row: agg_timespent(row[0], row[1]), axis=1)
# How to achieve this?
# Where NULL has been removed I think this is wrong as NULL records could
# indicate several hours gap between course accesses but as
# I'm using MAX and MIN then I'm ignoring the periods of null
# This is overestimating time on courses
# I need to subtract/remove/ignore?? the hours spent on null times
flattened_v2 = activity.pivot_table(index=['DAY'], values=["logtime"],aggfunc=[min, max],fill_value=0)
flattened_v2['time_diff'] = flattened_v2.apply(lambda row: agg_logtime(row[0], row[1]), axis=1)
print
print '*****Wrong!**********'
print 'This is not what I have but just showing how I thought it might work.'
print flattened_v1
print
print '******Not sure how to do this*********'
print 'This is wrong as nulls/gaps are also included too'
print flattened_v2
你说得对(在你的评论中):你需要
dataframe.shift
。你知道吗如果我对你的问题理解正确,你需要记录自上次时间戳以来经过的时间,因此时间戳表示活动的开始,当上次活动是
null
时,我们不应该记录任何经过的时间。假设这些都是正确的,使用shift
添加一列时间差:现在第一行将显示特殊的“not a time”值
NaT
,但这很好,因为我们无法计算经过的时间。接下来,我们可以为刚刚发生null
事件的任何已用时间填充更多的NaT
值:当我们想知道在
course1
上花费了多少时间时,我们必须再次切换,因为对course1
行进行索引将生成course1
开始的行。我们需要那些course1
正在完成/已经完成的:在您的示例中,
course1
返回15小时,course2
返回20小时。你知道吗相关问题 更多 >
编程相关推荐