需要优化/避免pandas。在groupby调用中重新采样(对于当前大于160s的1.4k行,需要将其降低到<60s)

2024-06-28 19:55:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有时间序列条目需要重新取样。在更极端的情况下,我想象有人可能会生成15个月的数据——这往往是大约1300条记录(大约每2个度量条目中有5个位置条目)。但在重新采样到15分钟间隔后,全套大约是41000行。你知道吗

我的数据现在还不到几十列,所以需要计算20列*40k≈800k的值。。看来我真的可以把时间降到10秒以下。我已经做了一个初始配置文件,看起来瓶颈主要是在我调用的一对熊猫重采样方法中——它们的速度惊人地慢!到了我想知道是否有什么不对劲的地步…为什么熊猫重新取样会这么慢?你知道吗

这会在google云函数中产生一个超时。这就是我需要避免的。你知道吗

有两组数据:位置和度量。示例位置数据可能如下所示:

location    bar     girlfriends     grocers     home    lunch   park    relatives   work 
date    user                                
2018-01-01 00:00:01     0ce65715-4ec7-4ca2-aab0-323c57603277    0   0   0   1   0   0   0   0

示例度量数据可能如下所示:

user    date    app     app_id  metric
0   4fb488bc-aea0-4f1e-9bc8-d7a8382263ef    2018-01-01 01:30:43     app_2   c2bfd6fb-44bb-499d-8e53-4d5af522ad17    0.02
1   6ca1a9ce-8501-49f5-b7d9-70ac66331fdc    2018-01-01 04:14:59     app_2   c2bfd6fb-44bb-499d-8e53-4d5af522ad17    0.10

我需要将这两个子集合并到一个分类账中,每个位置名称和每个应用程序都有列。应用程序中的值是常量的样本,所以我需要“连接点”。位置中的值是位置更改事件,因此在下一个更改事件之前,我需要不断重复相同的值。总之,它是这样的:

    app_1   app_2   user    bar     grocers     home    lunch   park    relatives   work
date                                        
2018-01-31 00:00:00     0.146250    0.256523    4fb488bc-aea0-4f1e-9bc8-d7a8382263ef    0   0   1   0   0   0   0
2018-01-31 00:15:00     0.146290    0.256562    4fb488bc-aea0-4f1e-9bc8-d7a8382263ef    0   0   0   0   0   0   1

这段代码可以做到这一点,但需要进行优化。这里最薄弱的环节是什么?我添加了基本的截面轮廓:

import time
start = time.time()

locDf = locationDf.copy()
locDf.set_index('date', inplace=True)

# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('user'):
    locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()

aDf = appDf.copy()
aDf.set_index('date', inplace=True)

print("section1:", time.time() - start)

userLocAppDfs = {}
for user, a2_df in aDf.groupby('user'):    
    start = time.time()

    # per user, convert app data to 15m interval
    userDf = a2_df.resample('15T').agg('max')

    print("section2.1:", time.time() - start)
    start = time.time()

    # assign metric for each app to an app column for each app, per user   
    userDf.reset_index(inplace=True)
    userDf = pd.crosstab(index=userDf['date'], columns=userDf['app'], values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')

    userDf['user'] = user

    userDf.reset_index(inplace=True)
    userDf.set_index('date', inplace=True)

    print("section2.2:", time.time() - start)
    start = time.time()

    # reapply 15m intervals now that we have new data per app
    userLocAppDfs[user] = userDf.resample('15T').agg('max')

    print("section2.3:", time.time() - start)
    start = time.time()

    # assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created    
    loDf = locDfs[user]
    loDf.reset_index(inplace=True)
    loDf = pd.crosstab([loDf.date, loDf.user], loDf.location)
    loDf.reset_index(inplace=True)

    loDf.set_index('date', inplace=True)
    loDf.drop('user', axis=1, inplace=True)

    print("section2.4:", time.time() - start)
    start = time.time()

    # join the location crosstab columns with the app crosstab columns per user
    userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
    # convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
    userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
    userLocAppDfs[user]['user'].fillna(user, inplace=True)

    print("section2.5:", time.time() - start)
    start = time.time()

    for loc in locationDf[locationDf['user'] == user].location.unique():

        # fill location NaNs
        userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(np.nan, 0)

    print("section3:", time.time() - start)
    start = time.time()


    # fill app NaNs
    for app in a2_df['app'].unique():
        userLocAppDfs[user][app].interpolate(method='linear', limit_area='inside', inplace=True)       
        userLocAppDfs[user][app].fillna(value=0, inplace=True)

    print("section4:", time.time() - start)

结果:

section1: 41.67342448234558
section2.1: 11.441165685653687
section2.2: 0.020460128784179688
section2.3: 5.082422733306885
section2.4: 0.2675948143005371
section2.5: 40.296404123306274
section3: 0.0076410770416259766
section4: 0.0027387142181396484
section2.1: 11.567803621292114
section2.2: 0.02080368995666504
section2.3: 7.187351703643799
section2.4: 0.2625312805175781
section2.5: 40.669641733169556
section3: 0.0072269439697265625
section4: 0.00457453727722168
section2.1: 11.773712396621704
section2.2: 0.019629478454589844
section2.3: 6.996192693710327
section2.4: 0.2728455066680908
section2.5: 45.172399282455444
section3: 0.0071871280670166016
section4: 0.004514217376708984

两个“大”部分都有对resampleagg('max')的调用。你知道吗

注意事项:

我在12个月前发现了这个问题:Pandas groupby + resample first is really slow - since version 0.22——似乎groupby中的resample()当前已损坏。你知道吗


Tags: trueappfordateindextimelocationstart