带决策树回归模型的负交叉值得分

2024-05-09 18:36:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在评估一个决策树回归预测模型与交叉价值评分法。问题是,分数似乎是负数,我真的不明白为什么。在

这是我的代码:

all_depths = []
all_mean_scores = []
for max_depth in range(1, 11):
    all_depths.append(max_depth)
    simple_tree = DecisionTreeRegressor(max_depth=max_depth)
    cv = KFold(n_splits=2, shuffle=True, random_state=13)
    scores = cross_val_score(simple_tree, df.loc[:,'system':'gwno'], df['gdp_growth'], cv=cv)
    mean_score = np.mean(scores)
    all_mean_scores.append(np.mean(scores))
    print("max_depth = ", max_depth, scores, mean_score, sem(scores))

结果是:

^{pr2}$

我的问题如下:

1)分数返回MSE对吗?如果是这样,怎么会是负的呢?在

2)我有大约40个观察值和大约70个变量的小样本。这可能是问题所在吗?在

提前谢谢。在


Tags: 决策树treedfnpallmeansimple分数
2条回答

这是可能发生的。已经在这个post中回答了!在

实际的MSE只是你得到的数字的正数。在

统一计分API总是使分数最大化,因此需要最小化的分数被否定,以便统一计分API正确工作。因此,当返回的分数是应最小化的分数时,它将被否定;如果该分数应该最大化,则返回的分数为正。在

TL、DR:

1)不,除非您显式指定,或者它是估计器的默认.score方法。因为您没有,它默认为DecisionTreeRegressor.score,它返回决定系数,即R^2。可能是负数。在

2)是的,这是个问题。这也解释了为什么你会得到一个负的决定系数。在

细节:

您使用的函数如下:

scores = cross_val_score(simple_tree, df.loc[:,'system':'gwno'], df['gdp_growth'], cv=cv)

所以你没有显式地传递一个“scoring”参数。让我们看看docs

scoring : string, callable or None, optional, default: None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y).

所以它没有明确说明,但这可能意味着它使用了估计器的默认.score方法。在

为了证实这个假设,让我们深入研究source code。我们看到最终使用的记分器如下:

^{pr2}$

让我们看看source for ^{}

has_scoring = scoring is not None
if not hasattr(estimator, 'fit'):
    raise TypeError("estimator should be an estimator implementing "
                    "'fit' method, %r was passed" % estimator)
if isinstance(scoring, six.string_types):
    return get_scorer(scoring)
elif has_scoring:
    # Heuristic to ensure user has not passed a metric
    module = getattr(scoring, '__module__', None)
    if hasattr(module, 'startswith') and \
       module.startswith('sklearn.metrics.') and \
       not module.startswith('sklearn.metrics.scorer') and \
       not module.startswith('sklearn.metrics.tests.'):
        raise ValueError('scoring value %r looks like it is a metric '
                         'function rather than a scorer. A scorer should '
                         'require an estimator as its first parameter. '
                         'Please use `make_scorer` to convert a metric '
                         'to a scorer.' % scoring)
    return get_scorer(scoring)
elif hasattr(estimator, 'score'):
    return _passthrough_scorer
elif allow_none:
    return None
else:
    raise TypeError(
        "If no scoring is specified, the estimator passed should "
        "have a 'score' method. The estimator %r does not." % estimator)

所以请注意,scoring=None已经完成,所以:

has_scoring = scoring is not None

暗示has_scoring == False。另外,估计器有一个.score属性,所以我们要通过这个分支:

elif hasattr(estimator, 'score'):
    return _passthrough_scorer

这很简单:

def _passthrough_scorer(estimator, *args, **kwargs):
    """Function that wraps estimator.score"""
    return estimator.score(*args, **kwargs)

最后,我们现在知道scorer就是你的估计器默认的score。让我们检查一下docs for the estimator,它清楚地表明:

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

所以看起来你的分数实际上就是决定系数。所以,基本上,R^2为负值,意味着你的模型表现得很差。比我们仅仅预测每个输入的期望值(即平均值)更糟糕。这是有道理的,因为正如你所说:

I have a small sample of ~40 observations and ~70 variables. Might this be the problem?

这是个问题。当你只有40个观测值时,对一个70维的问题空间进行有意义的预测几乎是没有希望的。在

相关问题 更多 >