对模型和校准的ClassifiedRCV使用相同的验证集

2024-10-05 17:44:06 发布

您现在位置:Python中文网/ 问答频道 /正文

在校准分类RCV的documentation中说明

The samples that are used to train the calibrator should not be used to train the target classifier.

据我所知;如果我们用X_train来训练我们的model,那么我们就不应该用X_train来训练calibrator(因为我假设它只是把X_train映射到model.predict_proba?)

但是,使用用于model超参数优化的验证集X_val来校准calibrator是否公平


Tags: thetomodelthatdocumentation分类trainare
1条回答
网友
1楼 · 发布于 2024-10-05 17:44:06

CalibratedClassifierCV在数据样本上拟合分类器,然后查看预测的概率是否与不同数据集上的实际标签很好地对应。然后,它尝试调整(“校准”)proba,以便对于proba为0.7 eg的一组样本,将有大约0.7个带有“1”的真实标签(对于二进制分类)

注意,训练分类器和拟合校准器是在不同的数据集上进行的,以解释对未知数据的可能偏差。这可以通过两种不同的方式完成:

  1. CalibratedClassifierCV(..., cv=None, ensemble=True)。您可以根据提供的cv策略分割列车数据,例如cv=5。通常,在k-1折叠上训练基估计器(要校准的分类器),然后在第k个左外折叠上装配校准器。测试的结果预测将是k个安装校准器(“集合”)的平均值

  2. 如果ensemple=False,将使用cross_val_predict的单个配合来安装校准器

考虑到这一点:

would it be fair to use the validation set, X_val which has been used for hyper-parameter optimization for model to calibrate calibrator?

找到最佳的超参数,然后是的,使用相同的序列数据集,但使用所选的cv策略(集合真或假),以适合校准器。再次注意,您永远不会在同一数据上安装分类器和calibrartor

有趣的是,根据分类器和数据的大小,校准可能有助于提高概率预测的准确性,也可能没有:

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

np.random.seed(42)

# Create classifiers
lrc = LogisticRegression(n_jobs=-1)
gnb = GaussianNB()
svc = SVC(C=1.0, probability=True,)
rfc = RandomForestClassifier(n_estimators=300, max_depth=3,n_jobs=-1)
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=3,
    objective="binary:logistic",
    eval_metric="logloss",
    use_label_encoder=False,
)
lgb = LGBMClassifier(n_estimators=300, objective="binary", max_depth=3)
cat = CatBoostClassifier(n_estimators=300, max_depth=3, objective="Logloss", verbose=0)

df = pd.DataFrame()

plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")

for clf, name in [
    (lrc, "Logistic"),
    (gnb, "Naive Bayes"),
#     (svc, "Support Vector Classification"),
    (rfc, "Random Forest"),
    (lgb, "Light GBM"),
    (xgb, "Xgboost"),
    (cat, "Catboost"),
]:
    print(name)
    for nsamples in [1000,10000,100000]:
        train_samples = 0.75 
        X, y = make_classification(
            n_samples=nsamples, n_features=20, n_informative=2, n_redundant=2
        )
        i = int(train_samples * nsamples)
        X_train = X[:i]
        X_test  = X[i:]
        y_train = y[:i]
        y_test  = y[i:]
        clf.fit(X_train, y_train)
        prob_pos = clf.predict_proba(X_test)[:, 1]
        fraction_of_positives, mean_predicted_value = calibration_curve(
            y_test, prob_pos, n_bins=10
        )

        if nsamples in [10000]:
            ax1.plot(
                mean_predicted_value,
                fraction_of_positives,
                "s-",
                label="%s" % (name + " nsamples " + str(nsamples),),
            )

            ax2.hist(
                prob_pos,
                bins=10,
                label="%s" % (name + " nsamples " + str(nsamples),),
                histtype="step",
                lw=2,
            )

        preds = clf.predict_proba(X_test)
        ll_before = log_loss(y_test, preds)
        preds = (
            CalibratedClassifierCV(clf, cv=5)
            .fit(X_train, y_train)
            .predict_proba(X_test)
        )
        ll_after = log_loss(y_test, preds)
        df = df.append(pd.DataFrame({
            "Samples": [nsamples],
            "Model": name,
            "LogLoss Before": round(ll_before,4),
            "LogLoss After": round(ll_after,4),
            "Gain": round(ll_before/ll_after,4)
        }))

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title("Calibration plots  (reliability curve)")

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()
print(df)

   Samples          Model  LogLoss Before  LogLoss After    Gain
0     1000       Logistic          0.3941         0.3854  1.0226
0    10000       Logistic          0.1340         0.1345  0.9959
0   100000       Logistic          0.1645         0.1645  0.9999
0     1000    Naive Bayes          0.3025         0.2291  1.3206
0    10000    Naive Bayes          0.4094         0.3055  1.3403
0   100000    Naive Bayes          0.4119         0.2594  1.5881
0     1000  Random Forest          0.4438         0.3137  1.4146
0    10000  Random Forest          0.3450         0.2776  1.2427
0   100000  Random Forest          0.3104         0.1642  1.8902
0     1000      Light GBM          0.2993         0.2219  1.3490
0    10000      Light GBM          0.2074         0.2182  0.9507
0   100000      Light GBM          0.2397         0.2534  0.9459
0     1000        Xgboost          0.1870         0.1638  1.1414
0    10000        Xgboost          0.3072         0.2967  1.0351
0   100000        Xgboost          0.1136         0.1186  0.9575
0     1000       Catboost          0.1834         0.1901  0.9649
0    10000       Catboost          0.1251         0.1377  0.9085
0   100000       Catboost          0.1600         0.1727  0.9264

enter image description here

相关问题 更多 >