当您遇到多类问题时,校准概率的正确方法是什么?

2024-10-02 02:41:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在培训一个模型,以根据贷款状态(例如0,1,2,3)预测标签(目标)。所以我有4节课。到目前为止,我已经培训了一个模型,如下所示:

  from HyperclassifierSearch import HyperclassifierSearch

X = data.iloc[:, :-1]
y = data.label

    
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, 
random_state=42)
# Create a hold out dataset to train the calibrated model to prevent overfitting
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, 
stratify=y_train, test_size=0.2, random_state=42)
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
              
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan, fill_value=0) ),('scaler', StandardScaler())])
   
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols),
                                    ('cat', categorical_transformer, cat_cols)])


#then i use hyperclassifer library 

models = {  'xgb': Pipeline(steps=[('preprocessor', preprocessor),('clf', XGBClassifier(objective='multi:softprob'))]),
                       'rf': Pipeline(steps=[('preprocessor', preprocessor),('clf', RandomForestClassifier(criterion = 'entropy', random_state = 42))]) }


search = HyperclassifierSearch(models, params)
best_grid = search.train_model(X_train, y_train, cv=3, n_jobs=-1, scoring='accuracy')
results = search.evaluate_model()
fitted_model = best_grid.best_estimator_

pred = fitted_model.predict_proba(X_test)
labels = fitted_model.predict(X_test)

**注意,我省略了大量导入的libs和params dict,因为它很大,所以只包含超分类功能**

我的pred是一个矩阵,包含4列,每列都与贷款类别相关。一般来说,我知道校准概率是很好的做法,特别是从基于树的算法中,输出是一个分数,而不是一个概率。然而,我对如何校准这些概率感到困惑

通常我会使用保持验证集进行校准,但不确定如何使用多类进行校准

更新

我是否应该通过执行以下操作来扩展上述xgbclassifier:

OneVsRestClassifier(CalibratedClassifierCV(XGBClassifier(objective='multi:softprob'), cv=10))

资料来源:Multiclass linear SVM in python that return probability

我的问题是,从多类模型中校准概率的正确方法是什么?

Tags: 模型testsearchmodelpipelinetrainrandomsteps

热门问题