仅使用前面的行迭代计算数据帧中每一行的聚合/统计信息的最有效方法是什么?

2024-04-23 14:49:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我在做一个分析,它要求我为1200万条记录的数据集中的每条记录计算摘要统计数据和引导置信区间,这些数据仅基于共享位置id、小时、天列值的记录,并且其在时间戳创建的时间(这里称为两个小时桶)为<;要为其计算统计信息的记录的时间戳。你知道吗

我尝试过几种不同的方法,但很难找到一种能在合理的时间内返回所需计算结果的方法。任何关于更有效方法的建议都将不胜感激!你知道吗

这是我写的引导函数:

def bstrp_std(data):
    '95th percentile highest bootstrapped std estimate'
    n = len(data)
    boot = np.random.choice(data, size = (1000, n))
    stat = np.sort(np.std(boot,1))
    return(stat[950])

def bstrp_avg(data):
    '95th percentile highest bootstrapped mean estimate'
    n = len(data)
    boot = np.random.choice(data, size = (1000, n))
    stat = np.sort(np.mean(boot,1))
    return(stat[950])

我尝试过的方法:

使用.iterrows()迭代数据帧 太慢了

for n, i in enumerate(clean.iterrows()):
    if n<1:
        df = clean[(clean['location_id']==i[1][1]) 
                   & (clean['day']==i[1][8]) 
                   & (clean['hour']==i[1][9]) 
                   & (clean['two_hour_buckets_x'] < i[1][0])].copy()

        df = df.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
                                                                                        'mean_bstp': lambda x: bstrp_avg(x),
                                                                                       'mean': np.mean,
                                                                                       'std': np.std, 
                                                                                        'sample_size':np.size}).reset_index()
        df['two_hour_buckets'] = i[1][0]

    else:
        temp = clean[(clean['location_id']==i[1][1]) 
                     & (clean['day']==i[1][8]) 
                     & (clean['hour']==i[1][9]) 
                     & (clean['two_hour_buckets_x'] < i[1][0])].copy()

        temp = temp.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
                                                                                        'mean_bstp': lambda x: bstrp_avg(x),
                                                                                       'mean': np.mean,
                                                                                       'std': np.std, 
                                                                                        'sample_size':np.size}).reset_index()
        temp['two_hour_buckets'] = i[1][0]
        df = df.append(temp, ignore_index = True)

将函数直接应用于数据帧 也太慢了

def brstrp_big(table):
    array = clean[(clean['location_id']==table['location_id']) 
                   & (clean['day']==table['day']) 
                   & (clean['hour']==table['hour']) 
                   & (clean['two_hour_buckets_x'] < table['two_hour_buckets_x'])]['outgoing_payment_amount']
    if len(array) < 2:
        return 0
    else:
        n = len(array)
        boot = np.random.choice(array, size = (1000, n))
        stat = np.sort(np.std(boot,1))
        return(stat[950])

clean['bstd_std'] = clean.apply(brstrp_big, axis = 1)

合并dataframe的副本,然后过滤合并行的时间戳为具有相同位置id、小时、天的原始行的时间戳的行,然后执行groupby聚合 非常慢,第一个操作生成365m行,尽管看起来是最好的尝试

eeee = clean1.merge(clean2, how='left', on = (['location_id', 'day', 'hour']), copy=False)
eeee = eeee[eeee['two_hour_buckets_x_x'] > eeee['two_hour_buckets_x_y']].copy()
eeee = eeee.groupby(['location_id', 'two_hour_buckets_x_x'])['outgoing_payment_amount']

eeee = eeee.agg({'mean': np.mean,
                 'std': np.std, 
                 'sample_size':np.size,
                 'std_bstp':lambda x: bstrp_std(x),
                 'mean_bstp': lambda x: bstrp_avg(x)}).reset_index()

输入:

      two_hour_buckets_x  location_id       day  hour  outgoing_payment_amount
2000          1434650400           59  Thursday    10                     0.00
2001          1434657600           59  Thursday    12                     0.00
2002          1434664800           59  Thursday    14                     0.00
2003          1434672000           59  Thursday    16                  1017.46
2004          1434679200           59  Thursday    18                     0.00
2005          1434686400           59  Thursday    20                     0.00
2006          1434693600           59  Thursday    22                     0.00
2007          1434700800           59    Friday     0                     0.00
2008          1434708000           59    Friday     2                     0.00
2009          1434715200           59    Friday     4                     0.00
2010          1434722400           59    Friday     6                     0.00
2011          1434729600           59    Friday     8                     0.00
2012          1434736800           59    Friday    10                     0.00
2013          1434744000           59    Friday    12                     0.00
2014          1434751200           59    Friday    14                     0.00
2015          1434758400           59    Friday    16                   528.22
2016          1434765600           59    Friday    18                   865.96
2017          1434772800           59    Friday    20                     0.00
2018          1434780000           59    Friday    22                     0.00
2019          1434787200           59  Saturday     0                     0.00

所需输出

      location_id  two_hour_buckets_x_x  sample_size    std_bstp         std  \
2000           59            1435255200           24  476.922804  350.986069   
2001           59            1435262400           24  696.152358  449.504956   
2002           59            1435269600           24  487.779153  383.545528   
2003           59            1435276800           24  489.020190  401.858948   
2004           59            1435284000           24  670.082177  535.158428   
2005           59            1435291200           24  297.647022  183.711731   
2006           59            1435298400           24    0.000000    0.000000   
2007           59            1435305600           24    0.000000    0.000000   
2008           59            1435312800           24    0.000000    0.000000   
2009           59            1435320000           24    0.000000    0.000000   
2010           59            1435327200           24    0.000000    0.000000   
2011           59            1435334400           24  115.976509   71.582255   
2012           59            1435341600           24  336.998549  251.685526   
2013           59            1435348800           24  495.415309  384.295034   
2014           59            1435356000           25  276.204290  221.158691   
2015           59            1435363200           25  646.605050  465.187672   
2016           59            1435370400           25  606.740824  501.532447   
2017           59            1435377600           25  207.046545  153.245775   
2018           59            1435384800           25    0.000000    0.000000   
2019           59            1435392000           25    0.000000    0.000000   

       mean_bstp        mean  
2000  276.157500  150.517500  
2001  302.775000  142.515000  
2002  342.689167  197.455000  
2003  382.813333  246.694167  
2004  459.903333  290.807500  
2005  112.500000   37.500000  
2006    0.000000    0.000000  
2007    0.000000    0.000000  
2008    0.000000    0.000000  
2009    0.000000    0.000000  
2010    0.000000    0.000000  
2011   43.835000   14.611667  
2012  183.258333   95.989167  
2013  333.573333  192.307500  
2014  176.465600  102.017600  
2015  411.064000  247.736800  
2016  466.547200  290.933600  
2017  105.095200   51.756000  
2018    0.000000    0.000000  
2019    0.000000    0.000000

Tags: cleanidsizenplocationmeanstdtwo