与Pandas或纽比的滚动相关计算

2024-05-18 22:13:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图为以下问题找到一种有效的方法:

两个数据帧,每个帧包含以下数据:
第一个:id, date, value

样本数据:

id, date, value
f130,200701,0.016196
f130,200702,-0.027798
f130,200703,-0.014868
f130,200704,0.017801
f130,200705,-0.032700
f130,200706,0.049529
f130,200707,0.011610
f130,200708,-0.008145
f130,200709,-0.001493
f130,200710,0.009719
f130,200711,-0.007775
f130,200712,-0.007835
f131,200701,0.044754
f131,200702,0.004679
f131,200703,-0.011824
f131,200704,0.007252
f131,200705,0.029877
f131,200706,0.001748
f131,200707,0.001047
f131,200708,-0.003137
f131,200709,0.001748
f131,200710,0.006632
f131,200711,-0.012136
f131,200712,0.004914

第二个:id_2, date, value

样本数据:

^{pr2}$

我需要的是,所有id & id_2对的两个value列之间的滚动窗口关联(滚动date列)
基本上,我的输出应该是:"id vs id_2", date, corr 因此,对于d1 vs f130,对于200706,我从200706开始计算d1和{}之间的相关性,从200706开始计算。所有对都一样。 预期产量:

id_pair, date, value
d1_f130,200706,-0.375238392
d1_f130,200707,-0.667154011
d1_f130,200708,-0.636064899
d1_f130,200709,-0.672029012
d1_f130,200710,-0.653719992
d1_f130,200711,-0.802893705
d1_f130,200712,-0.03120143
d1_f131,200706,0.870717009
d1_f131,200707,0.61076152
d1_f131,200708,0.400632396
d1_f131,200709,0.05064842
d1_f131,200710,0.087102168
d1_f131,200711,-0.012306865
d1_f131,200712,0.05170204
d2_f130,200706,-0.170979922
d2_f130,200707,-0.15363222
d2_f130,200708,-0.089709021
d2_f130,200709,-0.227564277
d2_f130,200710,-0.252391258
d2_f130,200711,0.94878745
d2_f130,200712,0.619029635
d2_f131,200706,0.358385975
d2_f131,200707,0.952074283
d2_f131,200708,0.930805345
d2_f131,200709,0.919101445
d2_f131,200710,0.904473885
d2_f131,200711,0.47080201
d2_f131,200712,0.640334152

使用for循环遍历id和日期需要几天时间。。。(身份证号码~15000,证件号码2~300,日期~300)

有什么想法吗?在


Tags: 数据方法iddatevalued2d1vs
1条回答
网友
1楼 · 发布于 2024-05-18 22:13:59

假设您有两个数据帧,如下所示:

# I change the columns name to simplify your pb
df1 = pd.DataFrame({'id1':id1, 'date':d1,'value1':v1})
df2 = pd.DataFrame({'id2':id2, 'date':d1,'value2':v2})

然后,您可以将两者合并为一个df,如:

^{pr2}$

现在按ID分组并应用滚动关联:

dfs = [] #create a collection to store each groupby result
for n, g in df.groupby(['id1','id2']):
    _df = pd.DataFrame({'ids':[n]*len(g.date),'date':g.date})
    #compute the correlations between the series of values
    _df['corr'] = g.value1.rolling(10).corr(g.value2)
    dfs.append(_df)

#concatenate your dataframes to have a single one 
final_df = pd.concat(dfs, ignore_index=True)


print(final_df) #show result. for ex:
#Note that first 9 rows for each ids pair are NaN according to my rolling corr options.
                   date     ids       corr
0    2008-01-01 13:34:00  (0, 0)       NaN
1    2008-01-01 13:34:00  (0, 0)       NaN
2    2008-01-01 13:35:00  (0, 0)       NaN
3    2008-01-01 13:37:00  (0, 0)       NaN
4    2008-01-01 13:37:00  (0, 0)       NaN
5    2008-01-01 13:37:00  (0, 0)       NaN
6    2008-01-01 13:38:00  (0, 0)       NaN
7    2008-01-01 13:38:00  (0, 0)       NaN
8    2008-01-01 13:40:00  (0, 0)       NaN
9    2008-01-01 13:41:00  (0, 0)  0.423877
10   2008-01-01 13:42:00  (0, 0)  0.555128

注:


更新
您可以重命名示例的标题,以符合以下答案:

df1.columns  = ['id1','date','value1']
df2.columns  = ['id2','date','value2']

您可以更改ids以适合预期的输出替换
'ids':[n]*len(g.date)
签署人:
'ids':['_'.join(n)]*len(g.date)
例如。在

相关问题 更多 >

    热门问题