在这个问题中,我有两个数据帧,我想在loan\u df中添加一个列,该列在reference\u df中聚合。因此,对于每一笔贷款,我想得到借款人的平均再收费日期之前的贷款采取(在这种情况下,90天前)。然后,我将把这个新列添加到loan\ u df。我下面的代码可以工作,但速度很慢。有什么办法让它超高效吗?你知道吗
def mean_rec_func(msisdn,date,advance_id,window, name):
"""Returns mean recharges within a specified number of days prior to loan being taken
Keyword Arguments:
msisdn -- APF_MSISDN for loan (this is like customer ID)
date -- APF_DATE on which loan taken
advance_id -- APF_ADVANCE_ID for loan
window -- number of days to look back(int)
name -- name of the newly computed stat
"""
mean_rec = recharge_df.loc[(recharge_df['APF_MSISDN'] == msisdn) &
(recharge_df['APF_DATE']<date)
& (recharge_df['APF_DATE']>=date - datetime.timedelta(days = window))
]['APF_AMOUNT'].mean()
return pd.Series([advance_id,msisdn,mean_rec], index=['APF_ADVANCE_ID', 'APF_MSISDN', name])
# Mean recharge over last 90 days
mean_recharge_90 = loan_df.apply(lambda row: mean_rec_func(row['APF_MSISDN'], row['APF_DATE'],
row['APF_ADVANCE_ID'],
window = 90,
name ="MEAN_RECHARGE_90"), axis = 1)
考虑一个SQL解决方案,因为您的逻辑将转换为以下带有相关聚合子查询的查询(无可否认,这也是一种昂贵的查询类型,因为聚合是为每个外部查询行运行的,类似于pandas
apply
循环)。你知道吗在pandas中,您可以使用^{} 模块来运行SQLite的内存实例:
下面是在
pandasql
引擎盖下运行的扩展版本,与SQLAlchemy和pandas的导入/导出调用接口:read_sql
和to_sql
。你知道吗相关问题 更多 >
编程相关推荐