def get_feature_correlation(df, top_n=None, corr_method='spearman',
remove_duplicates=True, remove_self_correlations=True):
"""
Compute the feature correlation and sort feature pairs based on their correlation
:param df: The dataframe with the predictor variables
:type df: pandas.core.frame.DataFrame
:param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
:param corr_method: Correlation compuation method
:type corr_method: str
:param remove_duplicates: Indicates whether duplicate features must be removed
:type remove_duplicates: bool
:param remove_self_correlations: Indicates whether self correlations will be removed
:type remove_self_correlations: bool
:return: pandas.core.frame.DataFrame
"""
corr_matrix_abs = df.corr(method=corr_method).abs()
corr_matrix_abs_us = corr_matrix_abs.unstack()
sorted_correlated_features = corr_matrix_abs_us \
.sort_values(kind="quicksort", ascending=False) \
.reset_index()
# Remove comparisons of the same feature
if remove_self_correlations:
sorted_correlated_features = sorted_correlated_features[
(sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
]
# Remove duplicates
if remove_duplicates:
sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]
# Create meaningful names for the columns
sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']
if top_n:
return sorted_correlated_features[:top_n]
return sorted_correlated_features
您可以使用^{} 使用提供的函数(例如卡方)对特征进行评分,并获得N个得分最高的特征。例如,为了保留前10项功能,您可以使用以下功能:
然而,通常有许多方法和技术可用于特征约简。您通常需要根据您的数据、您正在训练的模型和您想要预测的输出来决定使用哪些方法。例如,即使最终得到20个特征,也需要检查每对特征之间的相关性,如果它们高度相关,则删除一个
以下函数将为您提供最高的相关特性。您可以使用此输出进一步减少当前变量列表:
其他选择可以是:
正如我提到的,这实际上取决于你想要实现什么
相关问题 更多 >
编程相关推荐