Python中PCA的累积解释方差

2024-09-19 23:44:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个简单的R脚本,用于在一个小数据帧上运行FactoMineR's PCA,以便找到为每个变量解释的累计方差百分比:

library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)

df <- data.frame(a, b, c, d)

df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)

返回:

^{pr2}$

我尝试在Python中使用scikit-learn's decomposition package执行相同的操作,如下所示:

import pandas as pd
from sklearn import decomposition, linear_model

a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]

df = pd.DataFrame({'a': a,
                  'b': b,
                  'c': c, 
                  'd': d})

pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)

# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
    if i == 0:
        cum_explained_var.append(pca.explained_variance_ratio_[i])
    else:
        cum_explained_var.append(pca.explained_variance_ratio_[i] + 
                                 cum_explained_var[i-1])
print(cum_explained_var)

但这会导致:

[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]

如您所见,这两个变量的加起来都是100%,但是R和Python版本中每个变量的贡献似乎有所不同。有人知道这些差异来自哪里吗?或者知道如何在Python中正确地复制R结果吗?在

编辑:多亏了Vlo,我现在知道了这些差异来自FactoMineR PCA函数在默认情况下缩放数据。使用sklearn预处理包(pca_data=预处理.scale(df))为了在运行PCA之前缩放数据,我的结果与


Tags: 数据importdfdatavarprintratiopca
1条回答
网友
1楼 · 发布于 2024-09-19 23:44:42

感谢Vlo,我了解到FactoMineR PCA函数和sklearn PCA函数之间的区别在于FactoMineR one在默认情况下缩放数据。只需在python代码中添加一个缩放函数,就可以重现结果。在

import pandas as pd
from sklearn import decomposition, preprocessing

a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
e = [35, 84, 3, 54, 68]


df = pd.DataFrame({'a': a,
                  'b': b,
                  'c': c, 
                  'd': d})


pca_data = preprocessing.scale(df)

pca = decomposition.PCA(n_components = 4)
pca.fit(pca_data)
transformed_pca = pca.transform(pca_data)

cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
    if i == 0:
        cum_explained_var.append(pca.explained_variance_ratio_[i])
    else:
        cum_explained_var.append(pca.explained_variance_ratio_[i] + 
                                 cum_explained_var[i-1])

print(cum_explained_var)

输出:

^{pr2}$

相关问题 更多 >