我有时间序列条目需要重新取样。在更极端的情况下,我想象有人可能会生成15个月的数据——这往往是大约1300条记录(大约每2个度量条目中有5个位置条目)。但在重新采样到15分钟间隔后,全套大约是41000行。你知道吗
我的数据现在还不到几十列,所以需要计算20列*40k≈800k的值。。看来我真的可以把时间降到10秒以下。我已经做了一个初始配置文件,看起来瓶颈主要是在我调用的一对熊猫重采样方法中——它们的速度惊人地慢!到了我想知道是否有什么不对劲的地步…为什么熊猫重新取样会这么慢?你知道吗
这会在google云函数中产生一个超时。这就是我需要避免的。你知道吗
有两组数据:位置和度量。示例位置数据可能如下所示:
location bar girlfriends grocers home lunch park relatives work
date user
2018-01-01 00:00:01 0ce65715-4ec7-4ca2-aab0-323c57603277 0 0 0 1 0 0 0 0
示例度量数据可能如下所示:
user date app app_id metric
0 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 2018-01-01 01:30:43 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.02
1 6ca1a9ce-8501-49f5-b7d9-70ac66331fdc 2018-01-01 04:14:59 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.10
我需要将这两个子集合并到一个分类账中,每个位置名称和每个应用程序都有列。应用程序中的值是常量的样本,所以我需要“连接点”。位置中的值是位置更改事件,因此在下一个更改事件之前,我需要不断重复相同的值。总之,它是这样的:
app_1 app_2 user bar grocers home lunch park relatives work
date
2018-01-31 00:00:00 0.146250 0.256523 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 1 0 0 0 0
2018-01-31 00:15:00 0.146290 0.256562 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 0 0 0 0 1
这段代码可以做到这一点,但需要进行优化。这里最薄弱的环节是什么?我添加了基本的截面轮廓:
import time
start = time.time()
locDf = locationDf.copy()
locDf.set_index('date', inplace=True)
# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('user'):
locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()
aDf = appDf.copy()
aDf.set_index('date', inplace=True)
print("section1:", time.time() - start)
userLocAppDfs = {}
for user, a2_df in aDf.groupby('user'):
start = time.time()
# per user, convert app data to 15m interval
userDf = a2_df.resample('15T').agg('max')
print("section2.1:", time.time() - start)
start = time.time()
# assign metric for each app to an app column for each app, per user
userDf.reset_index(inplace=True)
userDf = pd.crosstab(index=userDf['date'], columns=userDf['app'], values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')
userDf['user'] = user
userDf.reset_index(inplace=True)
userDf.set_index('date', inplace=True)
print("section2.2:", time.time() - start)
start = time.time()
# reapply 15m intervals now that we have new data per app
userLocAppDfs[user] = userDf.resample('15T').agg('max')
print("section2.3:", time.time() - start)
start = time.time()
# assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created
loDf = locDfs[user]
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.user], loDf.location)
loDf.reset_index(inplace=True)
loDf.set_index('date', inplace=True)
loDf.drop('user', axis=1, inplace=True)
print("section2.4:", time.time() - start)
start = time.time()
# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['user'].fillna(user, inplace=True)
print("section2.5:", time.time() - start)
start = time.time()
for loc in locationDf[locationDf['user'] == user].location.unique():
# fill location NaNs
userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(np.nan, 0)
print("section3:", time.time() - start)
start = time.time()
# fill app NaNs
for app in a2_df['app'].unique():
userLocAppDfs[user][app].interpolate(method='linear', limit_area='inside', inplace=True)
userLocAppDfs[user][app].fillna(value=0, inplace=True)
print("section4:", time.time() - start)
结果:
section1: 41.67342448234558
section2.1: 11.441165685653687
section2.2: 0.020460128784179688
section2.3: 5.082422733306885
section2.4: 0.2675948143005371
section2.5: 40.296404123306274
section3: 0.0076410770416259766
section4: 0.0027387142181396484
section2.1: 11.567803621292114
section2.2: 0.02080368995666504
section2.3: 7.187351703643799
section2.4: 0.2625312805175781
section2.5: 40.669641733169556
section3: 0.0072269439697265625
section4: 0.00457453727722168
section2.1: 11.773712396621704
section2.2: 0.019629478454589844
section2.3: 6.996192693710327
section2.4: 0.2728455066680908
section2.5: 45.172399282455444
section3: 0.0071871280670166016
section4: 0.004514217376708984
两个“大”部分都有对resample
和agg('max')
的调用。你知道吗
我在12个月前发现了这个问题:Pandas groupby + resample first is really slow - since version 0.22——似乎groupby中的resample()当前已损坏。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐