为什么OLS回归的“sklearn”和“statsmodels”实现给出了不同的R^2?

2024-10-01 15:40:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我无意中注意到,sklearn和{}实现的OLS模型在不拟合截距时会产生不同的R^2值。否则他们似乎工作得很好。以下代码生成:

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

印刷品:

^{pr2}$

差异从何而来?在

这个问题与Different Linear Regression Coefficients with statsmodels and sklearn不同,因为{}(带截距)适用于为statsmodels.api.OLS准备的X。在

问题不同于 Statsmodels: Calculate fitted values and R squared 因为它解决了两个Python包(statsmodelsscikit-learn)之间的区别,而链接的问题是关于statsmodels和公共R^2定义。这两个问题的答案都是一样的,但是这个问题已经在这里讨论过了:Does the same answer imply that the questions should be closed as duplicate?


Tags: importapisizeasnprandomsklearnfit
1条回答
网友
1楼 · 发布于 2024-10-01 15:40:25

正如@user333700在评论中指出的,R^2的OLS定义在statsmodels'实现中与scikit-learn中不同

来自documentation of ^{} class(重点是我的):

rsquared

R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.

documentation of ^{}

score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual

sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

相关问题 更多 >

    热门问题