我在做一个分析,它要求我为1200万条记录的数据集中的每条记录计算摘要统计数据和引导置信区间,这些数据仅基于共享位置id、小时、天列值的记录,并且其在时间戳创建的时间(这里称为两个小时桶)为<;要为其计算统计信息的记录的时间戳。你知道吗
我尝试过几种不同的方法,但很难找到一种能在合理的时间内返回所需计算结果的方法。任何关于更有效方法的建议都将不胜感激!你知道吗
这是我写的引导函数:
def bstrp_std(data):
'95th percentile highest bootstrapped std estimate'
n = len(data)
boot = np.random.choice(data, size = (1000, n))
stat = np.sort(np.std(boot,1))
return(stat[950])
def bstrp_avg(data):
'95th percentile highest bootstrapped mean estimate'
n = len(data)
boot = np.random.choice(data, size = (1000, n))
stat = np.sort(np.mean(boot,1))
return(stat[950])
我尝试过的方法:
使用.iterrows()迭代数据帧 太慢了
for n, i in enumerate(clean.iterrows()):
if n<1:
df = clean[(clean['location_id']==i[1][1])
& (clean['day']==i[1][8])
& (clean['hour']==i[1][9])
& (clean['two_hour_buckets_x'] < i[1][0])].copy()
df = df.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
'mean_bstp': lambda x: bstrp_avg(x),
'mean': np.mean,
'std': np.std,
'sample_size':np.size}).reset_index()
df['two_hour_buckets'] = i[1][0]
else:
temp = clean[(clean['location_id']==i[1][1])
& (clean['day']==i[1][8])
& (clean['hour']==i[1][9])
& (clean['two_hour_buckets_x'] < i[1][0])].copy()
temp = temp.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
'mean_bstp': lambda x: bstrp_avg(x),
'mean': np.mean,
'std': np.std,
'sample_size':np.size}).reset_index()
temp['two_hour_buckets'] = i[1][0]
df = df.append(temp, ignore_index = True)
将函数直接应用于数据帧 也太慢了
def brstrp_big(table):
array = clean[(clean['location_id']==table['location_id'])
& (clean['day']==table['day'])
& (clean['hour']==table['hour'])
& (clean['two_hour_buckets_x'] < table['two_hour_buckets_x'])]['outgoing_payment_amount']
if len(array) < 2:
return 0
else:
n = len(array)
boot = np.random.choice(array, size = (1000, n))
stat = np.sort(np.std(boot,1))
return(stat[950])
clean['bstd_std'] = clean.apply(brstrp_big, axis = 1)
合并dataframe的副本,然后过滤合并行的时间戳为具有相同位置id、小时、天的原始行的时间戳的行,然后执行groupby聚合 非常慢,第一个操作生成365m行,尽管看起来是最好的尝试
eeee = clean1.merge(clean2, how='left', on = (['location_id', 'day', 'hour']), copy=False)
eeee = eeee[eeee['two_hour_buckets_x_x'] > eeee['two_hour_buckets_x_y']].copy()
eeee = eeee.groupby(['location_id', 'two_hour_buckets_x_x'])['outgoing_payment_amount']
eeee = eeee.agg({'mean': np.mean,
'std': np.std,
'sample_size':np.size,
'std_bstp':lambda x: bstrp_std(x),
'mean_bstp': lambda x: bstrp_avg(x)}).reset_index()
输入:
two_hour_buckets_x location_id day hour outgoing_payment_amount
2000 1434650400 59 Thursday 10 0.00
2001 1434657600 59 Thursday 12 0.00
2002 1434664800 59 Thursday 14 0.00
2003 1434672000 59 Thursday 16 1017.46
2004 1434679200 59 Thursday 18 0.00
2005 1434686400 59 Thursday 20 0.00
2006 1434693600 59 Thursday 22 0.00
2007 1434700800 59 Friday 0 0.00
2008 1434708000 59 Friday 2 0.00
2009 1434715200 59 Friday 4 0.00
2010 1434722400 59 Friday 6 0.00
2011 1434729600 59 Friday 8 0.00
2012 1434736800 59 Friday 10 0.00
2013 1434744000 59 Friday 12 0.00
2014 1434751200 59 Friday 14 0.00
2015 1434758400 59 Friday 16 528.22
2016 1434765600 59 Friday 18 865.96
2017 1434772800 59 Friday 20 0.00
2018 1434780000 59 Friday 22 0.00
2019 1434787200 59 Saturday 0 0.00
所需输出
location_id two_hour_buckets_x_x sample_size std_bstp std \
2000 59 1435255200 24 476.922804 350.986069
2001 59 1435262400 24 696.152358 449.504956
2002 59 1435269600 24 487.779153 383.545528
2003 59 1435276800 24 489.020190 401.858948
2004 59 1435284000 24 670.082177 535.158428
2005 59 1435291200 24 297.647022 183.711731
2006 59 1435298400 24 0.000000 0.000000
2007 59 1435305600 24 0.000000 0.000000
2008 59 1435312800 24 0.000000 0.000000
2009 59 1435320000 24 0.000000 0.000000
2010 59 1435327200 24 0.000000 0.000000
2011 59 1435334400 24 115.976509 71.582255
2012 59 1435341600 24 336.998549 251.685526
2013 59 1435348800 24 495.415309 384.295034
2014 59 1435356000 25 276.204290 221.158691
2015 59 1435363200 25 646.605050 465.187672
2016 59 1435370400 25 606.740824 501.532447
2017 59 1435377600 25 207.046545 153.245775
2018 59 1435384800 25 0.000000 0.000000
2019 59 1435392000 25 0.000000 0.000000
mean_bstp mean
2000 276.157500 150.517500
2001 302.775000 142.515000
2002 342.689167 197.455000
2003 382.813333 246.694167
2004 459.903333 290.807500
2005 112.500000 37.500000
2006 0.000000 0.000000
2007 0.000000 0.000000
2008 0.000000 0.000000
2009 0.000000 0.000000
2010 0.000000 0.000000
2011 43.835000 14.611667
2012 183.258333 95.989167
2013 333.573333 192.307500
2014 176.465600 102.017600
2015 411.064000 247.736800
2016 466.547200 290.933600
2017 105.095200 51.756000
2018 0.000000 0.000000
2019 0.000000 0.000000
目前没有回答
相关问题 更多 >
编程相关推荐