用于机器学习和数据科学的特征提取、处理和解释算法和功能。

feature-stuff的Python项目详细描述



feature_stuff:一个用于高级特征提取、处理和解释的python机器学习库。

Latest Release see on pypi.org
Package Statussee on pypi.org
License see on github
Build Status see on travis

它是什么

^ STR 1 } $FutuxLoad 是Python软件包,提供快速灵活的算法和功能 用于提取、处理和解释功能:

数字特征提取

feature_stuff.add_interactions generic function for adding interaction features to a data frame either by passing them as a list or by passing a boosted trees model to extract the interactions from.
feature_stuff.target_encoding target encoding of a feature column using exponential prior smoothing or mean prior smoothing
feature_stuff.cv_target_encoding target encoding of a feature column taking cross-validation folds as input
feature_stuff.add_knn_values creates a new feature with the K-nearest-neighbours of the values of a given feature
feature_stuff.model_features_insights_extractions.add_group_values generic and memory efficient enrichment of features dataframe with group values

model feature insights提取

get_xgboost_interactions takes a trained xgboost model and returns a list of interactions between features, to the order of maximum depth of all trees.

安装

最新版本的二进制安装程序可在Python package index上找到。

# or PyPI
pip install feature_stuff

源代码当前托管在GitHub上: https://github.com/hiflyin/Feature-Stuff

从源安装

Feature-Stuff目录中(与在 克隆git repo),执行:

python setup.py install

或者安装在development mode

python setup.py develop

或者,如果希望提取所有依赖项,可以使用pip 在automatic中(选项用于在development mode中安装它):

pip install -e .

如何使用

下面是一些函数的示例。有关完整的文档,请参阅每个函数/算法的附加api。

特色物品。添加互动

Inputs:
    df: a pandas dataframe
    model: boosted trees model (currently xgboost supported only). Can be None in which case the interactions have
    to be provided
    interactions: list in which each element is a list of features/columns in df, default: None

Output: df containing the group values added to it

从基于树的模型中提取交互并添加 它们是数据集的新特性。

importfeature_stuffasfsimportpandasaspdimportxgboostasxgbdata=pd.DataFrame({"x0":[0,1,0,1],"x1":range(4),"x2":[1,0,1,0]})printdatax0x1x20001111020213130target=data.x0*data.x1+data.x2*data.x1printtarget.tolist()[0,1,2,3]model=xgb.train({'max_depth':4,"seed":123},xgb.DMatrix(data,label=target),num_boost_round=2)fs.addInteractions(data,model)# at least one of the interactions in target must have been discovered by xgboostprintdatax0x1x2inter_000010111012021031303# if we want to inspect the interactions extractedfromfeature_stuffimportmodel_features_insights_extractionsasinsightsprintinsights.get_xgboost_interactions(model)[['x0','x1']]

feature_stuff.target_编码

Inputs:
    df: a pandas dataframe containing the column for which to calculate target encoding (categ_col)
    ref_df: a pandas dataframe containing the column for which to calculate target encoding and the target (y_col)
        for example we might want to use train data as ref_df to encode test data
    categ_col: the name of the categorical column for which to calculate target encoding
    y_col: the name of the target column, or target variable to predict
    smoothing_func: the name of the function to be used for calculating the weights of the corresponding target
        value inside ref_df. Default: exponentialPriorSmoothing.
    aggr_func: the statistic used to aggregate the target variable values inside each category of the categ_col
    smoothing_prior_weight: a prior weight to put on each category. Default 1.

Output: df containing a new column called <categ_col + "_bayes_" + aggr_func> containing the encodings of categ_col

从分类特征中提取目标编码并将其作为新特征添加到数据集的示例。

import feature_stuff as fs
import pandas as pd

train_data = pd.DataFrame({"x0":[0,1,0,1]})
test_data = pd.DataFrame({"x0":[1, 0, 0, 1]})
target = range(4)

train_data = fs.target_encoding(train_data, train_data, "x0", target, smoothing_func=fs.exponentialPriorSmoothing,
                                        aggr_func="mean", smoothing_prior_weight=1)
test_data = fs.target_encoding(test_data, train_data, "x0", target, smoothing_func=fs.exponentialPriorSmoothing,
                                        aggr_func="mean", smoothing_prior_weight=1)

#train data with target encoding of "x0"
print(train_data)
   x0  y_xx  g_xx  x0_bayes_mean
0   0     0     0       1.134471
1   1     1     0       1.865529
2   0     2     0       1.134471
3   1     3     0       1.865529

#test data with target encoding of "x0"
print(test_data)
   x0  x0_bayes_mean
0   1       1.865529
1   0       1.134471
2   0       1.134471
3   1       1.865529


feature_stuff.cv_target_编码

Inputs:
    df: a pandas dataframe containing the column for which to calculate target encoding (categ_col) and the target
    categ_cols: a list or array with the the names of the categorical columns for which to calculate target encoding
    y_col: a numpy array of the target variable to predict
    cv_folds: a list with fold pairs as tuples of numpy arrays for cross-val target encoding
    smoothing_func: the name of the function to be used for calculating the weights of the corresponding target
        value inside ref_df. Default: exponentialPriorSmoothing.
    aggr_func: the statistic used to aggregate the target variable values inside each category of the categ_col
    smoothing_prior_weight: a prior weight to put on each category. Default 1.
    verbosity: 0-none, 1-high_level, 2-detailed

Output: df containing a new column called <categ_col + "_bayes_" + aggr_func> containing the encodings of categ_col

请参阅上面的feature_stuff.target_编码示例。

贡献功能内容

欢迎所有贡献、错误报告、错误修复、文档改进、增强和想法。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
在数据库中存储密码的java加密方法   java正则表达式否定整个正则表达式   java为什么要得到这个Hashmap输出?   看不到玻璃鱼爪哇SE 6   类对象包装器中的Java基本数据字段   java从文本文件中读取整数并存储到单独的变量中?(扫描仪)   优化大型Java数据阵列的处理和管理   如何使用Java XML包装类创建对象   java为ExecutorService invokeAll()创建包装器   java如何在Android Studio 1.0.0中设置Facebook SDK?获取SDK位置未找到错误   java在尝试从线程启动动画时调用了FromErrorThreadException   java根据哈希确认文件内容   通过java在neo4j中获取索引值相同的所有节点?   java为什么我的Validare邮政编码(布尔)程序返回false?   java会话自动从servlet/jsp生成,尽管存在以下条件:<%@page session=“false”%>   创建新LANsocket时拒绝java连接   java如何多线程更新由sql代码更新的数据库?   安卓 Java使用类作为集合来添加项   安卓为什么我的清单文件不声明java包?