RFECV或任何其他特性选择前的数据准备

def find_correlation(data, threshold=0.9, remove_negative=False): corr_mat = data.corr() if remove_negative: corr_mat = np.abs(corr_mat) corr_mat.loc[:, :] = np.tril(corr_mat, k=-1) already_in = set() result = [] for col in corr_mat: perfect_corr = corr_mat[col][corr_mat[col] > threshold].index.tolist() if perfect_corr and col not in already_in: already_in.update(set(perfect_corr)) perfect_corr.append(col) result.append(perfect_corr) select_nested = [f[1:] for f in result] select_flat = [i for j in select_nested for i in j] return select_flat corrFeatList = find_correlation(x) fpd = x.drop(corrFeatList,axis = 1 ) fpd['label'] = catlabel fpd = fpd[fpd['label'].notnull()] Features = np.array(fpd.iloc[:,:-1]) Labels = np.array(fpd.iloc[:,-1]) hpd = fpd.iloc[:,:-1] headerName = hpd.columns #Scale first #Scaling normalisation scaler = preprocessing.StandardScaler() Features = scaler.fit_transform(Features) #RFECV logReg first ## Reshape the Label array Labels = Labels.reshape(Labels.shape[0],) ## Set folds for nested cross validation nr.seed(988) feature_folds = ms.KFold(n_splits=10, shuffle = True) ## Define the model logistic_mod = linear_model.LogisticRegression(C = 10, class_weight = "balanced") ## Perform feature selection by CV with high variance features only nr.seed(6677) selector = fs.RFECV(estimator = logistic_mod, cv = feature_folds) selector = selector.fit(Features, Labels) Features = selector.transform(Features) print('Best features :', headerName[selector.support_])

1条回答

网友

1楼 · 发布于 2024-10-06 12:52:01

RFECV只需获取原始数据，交叉验证模型，并删除与分类器/回归器一起提供的显著性最低的特性。然后递归地对所有保留的特征执行相同的操作。所以它没有明确意识到线性相关性。在

同时，这些特征的高相关性并不意味着其中一个特征是最好的去除对象。高度相关的特征可以承载一些有用的数据信息，例如它可以比保留的特征具有更小的方差。在

在一般情况下，降维并不意味着去除高度相关的特征，然而一些线性模型，如PCA隐含地做到了这一点。在

相关问题更多 >

编程相关推荐

热门问题

热门文章