如何使用sklearn（chisquare或ANOVA）删除冗余特征

1条回答

网友

1楼 · 发布于 2024-09-28 01:31:40

您可以使用^{}使用提供的函数（例如卡方）对特征进行评分，并获得N个得分最高的特征。例如，为了保留前10项功能，您可以使用以下功能：

from sklearn.feature_selection import SelectKBest, chi2, f_classif

# chi-square
top_10_features = SelectKBest(chi2, k=10).fit_transform(X, y)

# or ANOVA
top_10_features = SelectKBest(f_classif, k=10).fit_transform(X, y)

然而，通常有许多方法和技术可用于特征约简。您通常需要根据您的数据、您正在训练的模型和您想要预测的输出来决定使用哪些方法。例如，即使最终得到20个特征，也需要检查每对特征之间的相关性，如果它们高度相关，则删除一个

以下函数将为您提供最高的相关特性。您可以使用此输出进一步减少当前变量列表：

def get_feature_correlation(df, top_n=None, corr_method='spearman',
                            remove_duplicates=True, remove_self_correlations=True):
    """
    Compute the feature correlation and sort feature pairs based on their correlation

    :param df: The dataframe with the predictor variables
    :type df: pandas.core.frame.DataFrame
    :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
    :param corr_method: Correlation compuation method
    :type corr_method: str
    :param remove_duplicates: Indicates whether duplicate features must be removed
    :type remove_duplicates: bool
    :param remove_self_correlations: Indicates whether self correlations will be removed
    :type remove_self_correlations: bool

    :return: pandas.core.frame.DataFrame
    """
    corr_matrix_abs = df.corr(method=corr_method).abs()
    corr_matrix_abs_us = corr_matrix_abs.unstack()
    sorted_correlated_features = corr_matrix_abs_us \
        .sort_values(kind="quicksort", ascending=False) \
        .reset_index()

    # Remove comparisons of the same feature
    if remove_self_correlations:
        sorted_correlated_features = sorted_correlated_features[
            (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
        ]

    # Remove duplicates
    if remove_duplicates:
        sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]

    # Create meaningful names for the columns
    sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)'] 

    if top_n:
        return sorted_correlated_features[:top_n]

    return sorted_correlated_features

其他选择可以是：

缺失值的百分比
与目标变量的相关性
包括一些随机变量，看看它们是否进入随后的简化变量列表
随时间变化的特性稳定性
等等

正如我提到的，这实际上取决于你想要实现什么

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用sklearn（chisquare或ANOVA）删除冗余特征

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >