用于机器学习和数据科学的特征提取、处理和解释算法和功能。
feature-stuff的Python项目详细描述
feature_stuff:一个用于高级特征提取、处理和解释的python机器学习库。
Latest Release | see on pypi.org |
Package Status | see on pypi.org |
License | see on github |
Build Status | see on travis |
它是什么
^ STR 1 } $FutuxLoad 是Python软件包,提供快速灵活的算法和功能 用于提取、处理和解释功能:
数字特征提取
feature_stuff.add_interactions | generic function for adding interaction features to a data frame either by passing them as a list or by passing a boosted trees model to extract the interactions from. |
feature_stuff.target_encoding | target encoding of a feature column using exponential prior smoothing or mean prior smoothing |
feature_stuff.cv_target_encoding | target encoding of a feature column taking cross-validation folds as input |
feature_stuff.add_knn_values | creates a new feature with the K-nearest-neighbours of the values of a given feature |
feature_stuff.model_features_insights_extractions.add_group_values | generic and memory efficient enrichment of features dataframe with group values |
model feature insights提取
最新版本的二进制安装程序可在Python
package index上找到。 源代码当前托管在GitHub上:
https://github.com/hiflyin/Feature-Stuff 在 或者安装在development mode: 或者,如果希望提取所有依赖项,可以使用 下面是一些函数的示例。有关完整的文档,请参阅每个函数/算法的附加api。 从基于树的模型中提取交互并添加
它们是数据集的新特性。 从分类特征中提取目标编码并将其作为新特征添加到数据集的示例。 请参阅上面的feature_stuff.target_编码示例。 欢迎所有贡献、错误报告、错误修复、文档改进、增强和想法。get_xgboost_interactions
takes a trained xgboost model and returns a list of interactions between features, to the order of maximum
depth of all trees.
安装
# or PyPI
pip install feature_stuff
从源安装
Feature-Stuff
目录中(与在
克隆git repo),执行:python setup.py install
python setup.py develop
pip
在automatic中(选项用于在development
mode中安装它):pip install -e .
如何使用
特色物品。添加互动
Inputs:
df: a pandas dataframe
model: boosted trees model (currently xgboost supported only). Can be None in which case the interactions have
to be provided
interactions: list in which each element is a list of features/columns in df, default: None
Output: df containing the group values added to it
importfeature_stuffasfsimportpandasaspdimportxgboostasxgbdata=pd.DataFrame({"x0":[0,1,0,1],"x1":range(4),"x2":[1,0,1,0]})printdatax0x1x20001111020213130target=data.x0*data.x1+data.x2*data.x1printtarget.tolist()[0,1,2,3]model=xgb.train({'max_depth':4,"seed":123},xgb.DMatrix(data,label=target),num_boost_round=2)fs.addInteractions(data,model)# at least one of the interactions in target must have been discovered by xgboostprintdatax0x1x2inter_000010111012021031303# if we want to inspect the interactions extractedfromfeature_stuffimportmodel_features_insights_extractionsasinsightsprintinsights.get_xgboost_interactions(model)[['x0','x1']]
feature_stuff.target_编码
Inputs:
df: a pandas dataframe containing the column for which to calculate target encoding (categ_col)
ref_df: a pandas dataframe containing the column for which to calculate target encoding and the target (y_col)
for example we might want to use train data as ref_df to encode test data
categ_col: the name of the categorical column for which to calculate target encoding
y_col: the name of the target column, or target variable to predict
smoothing_func: the name of the function to be used for calculating the weights of the corresponding target
value inside ref_df. Default: exponentialPriorSmoothing.
aggr_func: the statistic used to aggregate the target variable values inside each category of the categ_col
smoothing_prior_weight: a prior weight to put on each category. Default 1.
Output: df containing a new column called <categ_col + "_bayes_" + aggr_func> containing the encodings of categ_col
import feature_stuff as fs
import pandas as pd
train_data = pd.DataFrame({"x0":[0,1,0,1]})
test_data = pd.DataFrame({"x0":[1, 0, 0, 1]})
target = range(4)
train_data = fs.target_encoding(train_data, train_data, "x0", target, smoothing_func=fs.exponentialPriorSmoothing,
aggr_func="mean", smoothing_prior_weight=1)
test_data = fs.target_encoding(test_data, train_data, "x0", target, smoothing_func=fs.exponentialPriorSmoothing,
aggr_func="mean", smoothing_prior_weight=1)
#train data with target encoding of "x0"
print(train_data)
x0 y_xx g_xx x0_bayes_mean
0 0 0 0 1.134471
1 1 1 0 1.865529
2 0 2 0 1.134471
3 1 3 0 1.865529
#test data with target encoding of "x0"
print(test_data)
x0 x0_bayes_mean
0 1 1.865529
1 0 1.134471
2 0 1.134471
3 1 1.865529
feature_stuff.cv_target_编码
Inputs:
df: a pandas dataframe containing the column for which to calculate target encoding (categ_col) and the target
categ_cols: a list or array with the the names of the categorical columns for which to calculate target encoding
y_col: a numpy array of the target variable to predict
cv_folds: a list with fold pairs as tuples of numpy arrays for cross-val target encoding
smoothing_func: the name of the function to be used for calculating the weights of the corresponding target
value inside ref_df. Default: exponentialPriorSmoothing.
aggr_func: the statistic used to aggregate the target variable values inside each category of the categ_col
smoothing_prior_weight: a prior weight to put on each category. Default 1.
verbosity: 0-none, 1-high_level, 2-detailed
Output: df containing a new column called <categ_col + "_bayes_" + aggr_func> containing the encodings of categ_col
贡献功能内容
推荐PyPI第三方库