随机预测中的超参数整定

#1. import the class/model from sklearn.ensemble import RandomForestRegressor #2. Instantiate the estimator RFReg = RandomForestRegressor(random_state = 1, n_jobs = -1) #3. Fit the model with data aka model training RFReg.fit(X_train, y_train) #4. Predict the response for a new observation y_pred = RFReg.predict(X_test) y_pred_train = RFReg.predict(X_train)

from sklearn.ensemble import RandomForestRegressor RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1) param_grid = { 'max_features' : ["auto", "sqrt", "log2"], 'min_samples_split' : np.linspace(0.1, 1.0, 10), 'max_depth' : [x for x in range(1,20)] from sklearn.model_selection import RandomizedSearchCV CV_rfc = RandomizedSearchCV(estimator=RFReg, param_distributions =param_grid, n_jobs = -1, cv= 10, n_iter = 50) CV_rfc.fit(X_train, y_train)

#1. import the class/model from sklearn.ensemble import RandomForestRegressor #2. Instantiate the estimator RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1, min_samples_split = 0.1, max_features = 'auto', max_depth = 18) #3. Fit the model with data aka model training RFReg.fit(X_train, y_train) #4. Predict the response for a new observation y_pred = RFReg.predict(X_test) y_pred_train = RFReg.predict(X_train)

from sklearn.ensemble import RandomForestRegressor RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1) param_grid = { 'max_features' : ["auto", "sqrt", "log2"], 'min_samples_split' : np.linspace(0.1, 1.0, 10), 'max_depth' : [x for x in range(1,20)] } from sklearn.model_selection import GridSearchCV CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10, n_jobs = -1) CV_rfc.fit(X_train, y_train)

def model_evaluate(y_train, y_test, y_pred, y_pred_train): metrics = {} #RMSE Test rmse_test = np.sqrt(mean_squared_error(y_test, y_pred)) #RMSE Train rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train)) metrics = { 'RMSE Test': rmse_test, 'RMSE Train': rmse_train} return metrics

1条回答

网友

1楼 · 发布于 2024-09-28 03:17:40

Why are the results of tuned model worst than the model with default parameters even when I am using RandomSearchCV and GridSearchCV. Ideally the model should give good results when tuned with cross-validation

你的第二个问题回答了你的第一个问题，但我试图在波士顿数据集上重现你的结果，我得到了{'test_rmse':3.987, 'train_rmse':1.442}和默认参数，{'test_rmse':3.98, 'train_rmse':3.426}是随机搜索的“优化”参数，而{}是网格搜索。然后我使用hyperopt和以下参数空间

 {'max_depth': hp.choice('max_depth', range(1, 100)),
    'max_features': hp.choice('max_features', range(1, x_train.shape[1])),
    'min_samples_split': hp.uniform('min_samples_split', 0.1, 1)}

在大约200次测试后，结果是这样的，所以我把这个空间扩大到'min_samples_split', 0.01, 1，这使我得到了{'test_rmse':3.278, 'train_rmse':1.716}的最佳结果，min_samples_split等于0.01。根据文献资料，min_samples_split的公式是ceil(min_samples_split * n_samples)，在我们的例子中给出了{}=34，对于这样一个小的数据集来说可能是很大的。在

I know that cross-validation will take place only for the combination of values present in param_grid.There could be values which are good but not included in my param_grid. So how do I deal with this kind of situation
How do I decide what range of values I should try for max_features, min_samples_split, max_depth or for that matter any hyper-parameters in a machine learning model to increase its accuracy.(So that I can atleast get a better tuned model than the model with default hyper-parameters)

你不可能事先知道这一点，所以你必须对每个算法进行研究，看看通常会搜索到什么样的参数空间（这方面的好来源是kaggle，例如googlekaggle kernel random forest），合并它们，考虑您的数据集特性，并使用某种Bayesian Optimization算法（有multiple existing libraries算法）对其进行优化，该算法尝试为新的参数值进行最佳选择。在

相关问题更多 >

编程相关推荐

热门问题

热门文章