所有子集上的岭回归rmse均高于总s

2024-10-03 02:47:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一个集合上训练了一个模型,并尝试在所有子集上使用它

从数学上讲,总rmse和mae(平均误差)应介于单个rsme和mae之间。但单次rmse和mae均高于总rmse和mae

我做了以下工作:

%pyspark
def preprocessing(features, attributes):

    features_2 = features[attributes]
    y = features['y'].values
    x = features_2.values 

    robustScaler = RobustScaler(quantile_range=(25.0,75.0))
    xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])

    xScaled[xScaled < -2.0] = -2.0 
    xScaled[xScaled > 2.0] = 2.0
    xCustomers = x[:,0]
    xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1)) 
    x_TS = xScaled 
    x_T0 = xScaled[:,:] 
    x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3)) 
    xCustR = xCustomers.reshape((x[:,0].size, 1)) 
    x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3))) 
    x_all = np.hstack((x_T0_all, x_TS_all))
    variable_names = features_2.columns.get_values()[1:].tolist() 
    return x_all, variable_names, y

def trainModel(features,attributes,optAlpha):
    x_all, variable_names, y = preprocessing(features, attributes)
    ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')
    ridge.fit(x_all, y)
    return ridge

def useModel(features,ridge,attributes):
    x_all, variable_names, y = preprocessing(features, attributes)
    y_pred = ridge.predict(x_all)
    rmse = np.sqrt(mean_squared_error(y,y_pred))
    mae = mean_absolute_error(y, y_pred)    
    print "RMSE on test set: ", round(rmse,2)
    print "MAE on test set:  ", round(mae,2)
    return y_pred, y, rmse, mae

ridge = trainModel(df_features_train, attributes, optAlpha)
useModel(df_features_train,ridge,attributes)

RMSE on test set:  67.05
MAE on test set:   52.5

现在我尝试使用useModel函数,包括对所有不同组织分别进行预处理

orgIDError = pd.DataFrame([],columns=['orgID','rmse','mae'])

for orgID in df_features['orgID'].unique():
    yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)
    df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])
    orgIDError = orgIDError.append(df)
print(orgIDError)

   orgID       rmse          mae
0  615   194.848564   155.502885
0  577   101.156573    76.083797
0  957  1564.256952   814.316566
0  763   832.782755   501.865561
0  616  1337.456555   860.404253
0  968   526.207558   347.265139
0  954  1570.315284  1149.191017
0  874   241.254153   202.429037
0  554   402.013992   344.846957
0  950  1073.348186   673.874603

有什么问题吗


Tags: dfnpallvariableattributesfeaturesrmseridge
1条回答
网友
1楼 · 发布于 2024-10-03 02:47:37

我自己发现的

预处理中的robustScaler在不同的集合/子集上工作方式不同

因此,子集中的值准备不同,因此不再适合模型

相关问题 更多 >