如何在滚动主成分分析中有效地存储主成分系数?

2024-07-04 16:36:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图将每个20天滚动窗口最后一天的前三个PCs系数、分数、特征值和残差(原始数据-重建)存储在20个不同到期日和108个交易日的CDS指数的每日收益数据集上。我试图在不借助scikit.learn的情况下自己编写整个算法,以达到教学目的。 不幸的是,我似乎无法将PCs的系数存储在我之前初始化的数据帧coeff1roll,coeff2roll,coeff3roll中。其他一切似乎都很好

请帮助我,并随时让我注意到我的程序中的任何其他错误

这是我的密码:

d = data
t = 20

scores1roll = np.zeros((n-t))  # scores
scores2roll = np.zeros((n-t))
scores3roll = np.zeros((n-t))

coeff1roll = pd.DataFrame(data = np.zeros((n-t, m)))  # loadings
coeff2roll = pd.DataFrame(data = np.zeros((n-t, m)))
coeff3roll = pd.DataFrame(data = np.zeros((n-t, m)))

eigvalroll1 = np.zeros((n-t))  # eigenvalues
eigvalroll2 = np.zeros((n-t))
eigvalroll3 = np.zeros((n-t))

res_roll = pd.DataFrame(data = np.zeros((n-t, m)))  # Residuals


# Loop with PCA step by step

for i in range(n-t):
    droll = d.iloc[i:i+t-1,:]  # window of data
    r,c = droll.shape
    meanroll = np.mean(droll, axis=0)  # mean by column
    stddevroll = np.sqrt(np.var(droll, axis=0))  # standard deviation by col
    dstdrollss = (droll - meanroll)/stddevroll
    # mean should be 0 and std dev 1
    cor_mat = (dstdrollss.T.dot(dstdrollss))/(r - 1)  # correlation matrix
    eigvalues, eigvectors = np.linalg.eig(cor_mat)
    eig_pairs = [(np.abs(eigval[i]), eigvec[:, i]) for i in 
range(len(eigval))]   # associating each eigvec to the correspondent eigval
    eig_pairs.sort(reverse=True)   # sorting according to abs val of eigenval
    loadingsroll = np.hstack((eig_pairs[0][1].reshape(20, 1),
                  eig_pairs[1][1].reshape(20, 1),
                  eig_pairs[2][1].reshape(20, 1)))  # selecting first 3
    scoresroll = dstdrollss.dot(loadingsroll)  # projections on new directions

    coeff1roll.iloc[i, :] = loadingsroll[:, 0] # renamed coeff  
    coeff2roll.iloc[i, :] = loadingsroll[:, 1]
    coeff3roll.iloc[i, :] = loadingsroll[:, 2]

    scores1roll[i] = scoresroll.iloc[0,0]
    scores2roll[i] = scoresroll.iloc[0,1]
    scores3roll[i] = scoresroll.iloc[0,2]

    eigvalroll1[i] = eigvalues[0]
    eigvalroll2[i] = eigvalues[1]
    eigvalroll3[i] = eigvalues[2]

    d_hatroll = scoresroll.dot(loadingsroll.T)  # back to raw normalized data
    d_rawroll = d_hatroll * stddevroll.values + meanroll.values # raw data
    c = dict(zip(d_rawroll.columns, droll.columns))
    resrolling = droll.subtract(d_rawroll.rename(columns=c),
                            axis='column')  # calculate residuals

    res_roll.iloc[i, :] = resrolling.iloc[0, :].values

我希望coeff1roll、coeff2roll、coeff3roll对每个数据点都有不同的值,但是循环返回的列值相等,正好是最近20天窗口中最近一天的系数值:

coeff1roll.head()

        0         1         2   ...        17        18       19
0  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
1  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
2  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
3  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
4  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368

[5 rows x 20 columns]

Tags: dataframedatanpzerospdpairsiloceig

热门问题