如何使用分类和非分类特征进行回归

2024-10-01 02:40:36 发布

您现在位置:Python中文网/ 问答频道 /正文

如果我有多个特性,但有些特性是分类的,有些则不是,那么使用sklearn进行回归的正确方法是什么

我尝试了“ColumnTransformer”,但我不确定我是否做得很好:

features = df[['grad', 'oblast', 'tip',
               'parcela', 'bruto', 'neto', 'osnova',
               'neto/bruto', 'zauzetost', 'sipovi', 'garaza',
               'nadzemno', 'podzemno', 'tavanica', 'fasada']]


results = df[['ukupno gradjevinski din']]


trans = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['grad', 'oblast', 'tip', 'garaza', 'tavanica', 'fasada']),
                                        ('normalizer', Normalizer(), ['parcela', 'bruto', 'neto', 'osnova', 'neto/bruto', 'zauzetost', 'nadzemno'])],
                          remainder='passthrough') # Default is to drop untransformed columns

features = trans.fit_transform(features)

当我为某些功能打印corr()时,我发现它们与结果之间存在很大的相关性:

print(df[['parcela', 'bruto', 'neto', 'osnova', 'ukupno gradjevinski din']].corr().to_string())

                          parcela     bruto      neto    osnova  ukupno gradjevinski din
parcela                  1.000000  0.929939  0.930039  0.987574                 0.911690
bruto                    0.929939  1.000000  0.998390  0.943996                 0.878914
neto                     0.930039  0.998390  1.000000  0.946102                 0.889850
osnova                   0.987574  0.943996  0.946102  1.000000                 0.937064
ukupno gradjevinski din  0.911690  0.878914  0.889850  0.937064                 1.000000

问题是,我已经叠加了7-8个回归模型,我正在用cross-validation对它们进行评估,但我得到的分数从-10到-80,这对我来说是不正常的

regressors = [
              ["Bagging Regressor TREE", BaggingRegressor(base_estimator = DecisionTreeRegressor(max_depth=15))],
              ["Bagging Regressor FOREST", BaggingRegressor(base_estimator = RandomForestRegressor(n_estimators = 100))],
              ["Bagging Regressor linear", BaggingRegressor(base_estimator = LinearRegression(normalize=True))],
              ["Bagging Regressor lasso", BaggingRegressor(base_estimator = Lasso(normalize=True))],
              ["Bagging Regressor SVR rbf", BaggingRegressor(base_estimator = SVR(kernel = 'rbf', C=10.0, gamma='scale'))],
              ["Extra Trees Regressor", ExtraTreesRegressor(n_estimators = 150)],
              ["K-Neighbors Regressor", KNeighborsRegressor(n_neighbors=1)]]


for reg in regressors:

     scores = cross_val_score(reg[1], features, results, cv=5, scoring='r2')

     scores = np.average(scores)
     print(reg[0], scores)

每次提到“Bagging Regressor linear”时,都会给我一个错误:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

即使我只使用您在corr()中看到的功能运行回归模型,我也会得到相同的结果

你能告诉我更多关于我的问题吗


Tags: dfbasefeaturesestimatorscoresdinregressorbaggingregressor
1条回答
网友
1楼 · 发布于 2024-10-01 02:40:36

将分类和非分类特征组合到回归模型的一种方法是对分类特征使用一个热编码。为了具体起见,如果您有一个可能有3个值的分类功能,那么您将创建3列,并根据其一个热编码值填充0和1

您可以在第213页的Introduction to Machine Learning with Python: A Guide for Data Scientists一书的第一节热编码(虚拟变量)中找到详细说明、示例和实现

相关问题 更多 >