Python AutoMunge-pkg包_程序模块 - PyPI

一种自动数据争用工具

AutoMunge-pkg的Python项目详细描述

自动咀嚼套装

automunge是一个工具，用于自动化机器学习应用之前的结构化（表格）数据。 automunge（.）函数以预期的结构化训练数据作为输入训练带有任何相应标签（如果有的话）的机器学习模型包括在集合中，如果可用，也包括一致格式的测试数据然后可以用它从训练好的模型中生成预测。什么时候？ fed pandas dataframes或numpy数组对于这些集合，函数返回每个选择的一系列转换后的numpy数组或pandas数据帧数字编码，适用于机器的直接应用学习算法。用户可以在默认功能工程之间选择基于数据的推断属性和特征转换，例如 z得分标准化，数值集的标准差箱，box-cox 所有正数值集的幂律变换，一个热编码分类集，以及更多（下面的完整文档），指定特定的使用内置特征库的柱特征工程方法工程转换，或者用户定义的传递包含简单数据结构的自定义转换函数，如在仍使用所有该工具的内置特性（如ml inflil、特性重要性，降维，最重要的是仅使用一个 postunge（.）函数的函数调用。集合中缺少数据点也可以通过为每一列或通过自动“ml填充”方法使用机器学习模型预测填充以完全通用和自动化的方式。automunge（.）返回python 可与后续测试数据一起用作输入的字典设置为postunge（.）函数，用于一致处理初始地址不可用。

除了用于特征工程转换之外，automunge（.）也可以通过特征重要性评估来达到评估目的通过两个指标的推导，为原始特征和派生特征对预测精度的重要性模型。

如果选中，用户还可以使用该工具通过主成分分析（一种基于无监督学习的实体嵌入）具有automunge（.）功能的数据集的 postunge（.）函数提供的数据。

Automunge现在可用于您的开源软件的免费PIP安装 python数据争用

pip install AutoMunge-pkg

安装后，在本地会话中运行此命令以初始化：

from AutoMunge_pkg import AutoMunge
am = AutoMunge.AutoMunge()

其中，列车/测试集处理运行的EG：

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
testlabelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train, df_test, etc)

或者对于测试数据的后续一致性处理，使用 automunge的原始应用程序返回的词典（.），运行：

test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test \ =
am.postmunge(postprocess_dict, df_test)

我发现用一系列参数传递这些函数很有帮助包括供参考，因此用户可以简单地复制并通过此表单。

#for automunge(.) function on original train and test data

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, labels_column = False, trainID_column = False, \
            testID_column = False, valpercent1=0.0, valpercent2 = 0.0, \
            shuffletrain = False, TrainLabelFreqLevel = False, powertransform = False, \
            binstransform = False, MLinfill = False, infilliterate=1, randomseed = 42, \
            numbercategoryheuristic = 15, pandasoutput = True, NArw_marker = True, \
            featureselection = False, featurepct = 1.0, featuremetric = .02, \
            featuremethod = 'pct', PCAn_components = None, PCAexcl = [], \
            ML_cmnd = {'MLinfill_type':'default', \
                       'MLinfill_cmnd':{'RandomForestClassifier':{}, 'RandomForestRegressor':{}}, \
                       'PCA_type':'default', \
                       'PCA_cmnd':{}}, \
            assigncat = {'mnmx':[], 'mnm2':[], 'mnm3':[], 'mnm4':[], 'mnm5':[], 'mnm6':[], \
                         'nmbr':[], 'nbr2':[], 'nbr3':[], 'MADn':[], 'MAD2':[], 'MAD3':[], \
                         'bins':[], 'bint':[], \
                         'bxcx':[], 'bxc2':[], 'bxc3':[], 'bxc4':[], \
                         'log0':[], 'log1':[], 'pwrs':[], \
                         'bnry':[], 'text':[], 'ordl':[], 'ord2':[], \
                         'date':[], 'dat2':[], 'wkdy':[], 'bshr':[], 'hldy':[], \
                         'excl':[], 'exc2':[], 'exc3':[], 'null':[], 'eval':[]}
            assigninfill = {'stdrdinfill':[], 'MLinfill':[], 'zeroinfill':[], 'oneinfill':[], \
                            'adjinfill':[], 'meaninfill':[], 'medianinfill':[]}, \
            transformdict = {}, processdict = {}, \
            printstatus = True)

请记住保存automunge（.）返回的对象后处理dict 例如使用pickle库，然后可以将其传递给postmunge（.）持续处理后续可用数据的功能。

#for postmunge(.) function on subsequently available test data
#using the postprocess_dict object returned from original automunge(.) application

test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test = \
am.postmunge(postprocess_dict, df_test, testID_column = False, \
             labelscolumn = False, pandasoutput=True, printstatus = True, \
             TrainLabelFreqLevel = False, featureeval = False):

功能依赖于pandas数据帧格式的列车和测试数据或列顺序一致的numpy数组。函数返回 numpy数组或pandas数据帧的数字编码和规范化使它们适合直接应用于机器学习模型在用户选择的框架中，包括各种活动的集合S公司一个通用的机器学习项目，比如训练，超参数调整验证（validation1）、最终验证（validation2）或预期数据用于从训练模型（测试集）生成预测。这个函数还返回一些其他的集合，如标签、列标题， id集等（如果选择的话）-下面是返回数组的完整列表。

当留待自动化时，该函数通过推断根据每个列的属性选择处理类型的数据要应用的函数，例如列是否是数字的、分类的，二进制或时间序列集。或者，用户可以将列标题id传递给将特定的处理函数分配给不同的列-哪些处理函数可以从内部转换库中提取，或者交替使用定义。来自初始automunge应用程序的规范化参数是保存到返回的字典中，以便后续一致地处理测试数据这在postmunge（.）函数的初始地址不可用。

特征工程转换用一系列后缀记录附加到返回集合中的列标题，例如 z-score规范化的应用程序返回具有标题origname+u+nmbr的列。该功能允许对训练数据、测试数据和以及为标签指定的任何列（如果包含在集合中）。

在自动化中，对于数值数据，函数生成一系列导致多个子列的转换。对于数值数据，如果选择powertransform选项时，将计算分布属性 z-score标准化、最小最大标度、幂律变换的潜在应用通过box-cox方法，或平均绝对偏差标度。其他数值数据默认为z-score，标准为z-score规范化选项范围<；-2、-2-1、-10、01、12、>；2中的值的偏差箱卑鄙。对于所有正值的数值集，函数也可以选择使用box-cox方法返回幂律变换集，以及应用了z-得分规范化的对应集合。对于时间序列数据模型按时间尺度（年，月，日，小时、分钟、秒）并返回一组z-score 应用标准化。对于二进制分类数据，函数返回具有1/0名称的单列。对于多模态分类数据函数使用命名返回一个热编码集约定原始名称+类别。（我相信一种热编码方法是对于所有情况，函数都会生成一个补充列（narw）。对由于以下原因而要填充的单元格使用布尔标识符数据丢失或格式不正确。（请注意，我没有很好地考虑了现有的数值集分布计算方法很复杂，在这里有一些工作要做）。

这些函数还包括一个我们称之为“ml infll”的方法，如果选择了这个方法使用机器学习模型通用化和自动化的方式。ml inflil最初的工作方式是使用传统方法（如数值计算的平均值）应用填充集合，二进制集合的最常用值，以及绝对的。然后，这些函数生成一组列特定的用于衍生内嵌的训练数据、标签和要素集。列的训练模型包含在输出的字典中同一模型在后凸函数中的应用。或者，a 用户可以传递列标题以将不同的内嵌方法分配给Distinct 柱。

automunge（.）函数还包括一个用于特征重要性的方法评估，其中导出度量来度量对predic的影响主动语态原始源列的准确性以及使用排列重要性方法派生列。排列重要性方法的灵感来自于fast.ai课程，更多信息可以在 Terrence Parr，Kerem的论文“小心默认的随机森林重要性” 图古特鲁，克里斯托弗·克西萨，杰里米·霍华德。此方法当前使使用scikit学习随机森林预测因子。

这个函数还包括一个我们称为“labelfreqlevel”的方法如果选中，则应用与每个在返回的培训数据中标记类别以便启用过采样那些标签可能在培训数据。此方法可用于分类标签或当标签处理包括标准偏差时，用于数字标签箱子。这种方法有望改进下游模型标签分布不均匀的训练数据的准确性。更多论阶级不平衡问题见《阶级不平衡的系统研究》卷积神经网络的问题“-Buda，Maki，Mazurowski.

该函数还可以通过主成分分析（pca）。该函数自动执行当特征数大于50%时的转换列车集合中的观测（这是一个有点任意的启发式方法）。或者，用户可以传递所需数量的功能及其线性pca、稀疏pca或核之间的类型和参数偏好主成分分析-目前在scikit learn中实现。

应用Automunge和Postmunge功能需要将函数赋值给一系列命名集。我们建议使用一致的命名约定如下：

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \ 
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train, ...)

这里给出了可传递的完整参数集，其中解释如下：

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, labels_column = False, trainID_column = False, \
            testID_column = False, valpercent1=0.0, valpercent2 = 0.0, \
            shuffletrain = False, TrainLabelFreqLevel = False, powertransform = False, \
            binstransform = False, MLinfill = False, infilliterate=1, randomseed = 42, \
            numbercategoryheuristic = 15, pandasoutput = True, NArw_marker = True, \
            featureselection = False, featurepct = 1.0, featuremetric = .02, \
            featuremethod = 'pct', PCAn_components = None, PCAexcl = [], \
            ML_cmnd = {'MLinfill_type':'default', \
                       'MLinfill_cmnd':{'RandomForestClassifier':{}, 'RandomForestRegressor':{}}, \
                       'PCA_type':'default', \
                       'PCA_cmnd':{}}, \
            assigncat = {'mnmx':[], 'mnm2':[], 'mnm3':[], 'mnm4':[], 'mnm5':[], 'mnm6':[], \
                         'nmbr':[], 'nbr2':[], 'nbr3':[], 'MADn':[], 'MAD2':[], 'MAD3':[], \
                         'bins':[], 'bint':[], \
                         'bxcx':[], 'bxc2':[], 'bxc3':[], 'bxc4':[], \
                         'log0':[], 'log1':[], 'pwrs':[], \
                         'bnry':[], 'text':[], 'ordl':[], 'ord2':[], \
                         'date':[], 'dat2':[], 'wkdy':[], 'bshr':[], 'hldy':[], \
                         'excl':[], 'exc2':[], 'exc3':[], 'null':[], 'eval':[]}
            assigninfill = {'stdrdinfill':[], 'MLinfill':[], 'zeroinfill':[], 'oneinfill':[], \
                            'adjinfill':[], 'meaninfill':[], 'medianinfill':[]}, \
            transformdict = {}, processdict = {}, \
            printstatus = True)

或者对于postunge函数：

#for postmunge(.) function on subsequentlky available test data
#using the postprocess_dict object returned from original automunge(.) application

test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test = \

完整的参数集可以作为：

am.postmunge(postprocess_dict, df_test, testID_column = False, \
             labelscolumn = False, pandasoutput=True, printstatus = True, \
             TrainLabelFreqLevel = False):

注意，automunge函数唯一需要的参数是 train set dataframe，其他参数都有默认值，如果什么都没通过。postunge函数至少需要后处理dict对象（从应用程序返回的python字典 automunge）和数据帧测试集的格式与这些集一致最初应用于Automunge的。

…

下面是automunge返回的集合的描述，其中后面是可以传递给的参数的描述函数，然后对postunge返回的集合进行类似的处理还有争论。

…

Automunge返回的集合：

列车：一组数字编码的数据，用于列车用户选择框架中的下游机器学习模型
列车ID：当一个ID 列已传递给函数。如果使用shuffle 已应用选项。
标签：一组数字编码的标签，对应于如果通过标签列，则设置列车。注意，函数假设标签列最初包含在列车组中。注意如果标签集是一列，则返回的numpy数组是扁平（例如，[1,2,3]]转换为[1,2,3]）
Validation1：从火车组中提取的一组训练数据用于下游模型的超参数调整。
validationid1：与validation1对应的id值集设置
validationlabels1：与validation1对应的标签集设置
验证2：从列车组中提取的一组训练数据用于下游模型的最终验证（这个集合不应广泛应用用于超参数调整）。
validationid2：与validation2对应的id值集准备好了。
validationlabels2：与validation2对应的标签集设置
测试：一组特性，作为训练数据，可用于从用火车训练下游模型。注意，如果没有测试数据在初始地址期间可用此处理将在后置（.）功能。
test id：与测试集对应的id值集。
testlabels：一组数字编码的标签，对应于测试集是否通过了标签列。注意，函数假设标签列最初包含在列车组中。
labelsencoding_dict：可用于反向编码的字典从下游模型生成的预测（例如将一个热编码集转换回单个分类集）。
finalColumns_train：与培训数据。注意，后缀附加器的包含用于确定哪些特征工程转换应用于每个列。
finalColumns_test：与测试数据。注意，后缀附加器的包含用于确定哪些特征工程转换应用于每个列。请注意，此列表应与前面的列表匹配。
特征重要性：包含特征重要性摘要的字典每个派生集的排名和度量。注意，公制值表示原始源的重要性列中的值越大表示重要性越大，而metric2 值指示派生列的相对重要性从原始源列中，使较小的metric2值建议更大的相对重要性。可以在这里打印值，例如此代码：

#to inspect values returned in featureimportance object one could run
for keys,values in featureimportance.items():
    print(keys)
    print('metric = ', values['metric'])
    print('metric2 = ', values['metric2'])
    print()

后处理dict：返回的python字典，包括标准化参数和训练的机器学习模型生成在上不可用的测试数据的一致处理 Automunge的初始地址。建议把这本词典保存在用于训练下游模型的每个应用程序上，以便传递给postmunge（.）以持续处理后续可用的测试数据。

…

automunge（.）传递了参数

am.automunge(df_train, df_test = False, labels_column = False, trainID_column = False, \
            testID_column = False, valpercent1=0.0, valpercent2 = 0.0, \
            shuffletrain = False, TrainLabelFreqLevel = False, powertransform = False, \
            binstransform = False, MLinfill = False, infilliterate=1, randomseed = 42, \
            numbercategoryheuristic = 15, pandasoutput = True, NArw_marker = True, \
            featureselection = False, featurepct = 1.0, featuremetric = .02, \
            featuremethod = 'pct', PCAn_components = None, PCAexcl = [], \
            ML_cmnd = {'MLinfill_type':'default', \
                       'MLinfill_cmnd':{'RandomForestClassifier':{}, 'RandomForestRegressor':{}}, \
                       'PCA_type':'default', \
                       'PCA_cmnd':{}}, \
            assigncat = {'mnmx':[], 'mnm2':[], 'mnm3':[], 'mnm4':[], 'mnm5':[], 'mnm6':[], \
                         'nmbr':[], 'nbr2':[], 'nbr3':[], 'MADn':[], 'MAD2':[], 'MAD3':[], \
                         'bins':[], 'bint':[], \
                         'bxcx':[], 'bxc2':[], 'bxc3':[], 'bxc4':[], \
                         'log0':[], 'log1':[], 'pwrs':[], \
                         'bnry':[], 'text':[], 'ordl':[], 'ord2':[], \
                         'date':[], 'dat2':[], 'wkdy':[], 'bshr':[], 'hldy':[], \
                         'excl':[], 'exc2':[], 'exc3':[], 'null':[], 'eval':[]}
            assigninfill = {'stdrdinfill':[], 'MLinfill':[], 'zeroinfill':[], 'oneinfill':[], \
                            'adjinfill':[], 'meaninfill':[], 'medianinfill':[]}, \
            transformdict = {}, processdict = {}, \
            printstatus = True)

df_train：包含结构化用于随后训练机器学习模型的数据集。设置的最小值应为“tidy”，即每个功能只有一列每次观察一行。如果需要，集合可以包括行id 列和用作下游标签的列训练操作。该工具支持包含非索引范围列作为索引或多列索引（需要命名索引列）。如此索引类型被添加到返回的“id”集合中，这些集合是一致的作为火车和测试设备被洗牌和分割。
df_测试：包含结构化用于从下游机器生成预测的数据集从automunge返回集训练的学习模型。集合必须是以一致的列标签和/或列的顺序。（此集合可以选择包含标签列，如果已包含在列车组中，但不需要包含在内）。如果所需的集合可以包括行id列或用于标签。如果此集合不可用，用户可能会传递false。工具支撑包含非索引范围列作为索引或多列索引（需要NAMED索引列）。这样的索引类型被添加到返回的 “id”集合作为火车和测试集。
labels_column：来自 DFU列车组，用于培训下游机器的标签学习模式。对于训练集不包含标签列。
trainid_column：来自用作行标识符值的DFU列车组（如例如，是序列号）。函数默认为false 训练集不包含id列的情况。用户可以同时传递一个字符串列标题列表，例如要从处理中排除但始终分区的列。
testid_column：来自用作行标识符值的df_测试集（例如例如序列号）。函数默认为false 训练集不包含id列的情况。用户可以同时传递一个字符串列标题列表，例如要从处理中排除但始终分区的列。
valpercent1：介于0和1之间的浮点值，指定百分比为第一次验证预留的培训数据集合（通常用于下游模型的超参数调整）。此值默认为0。（以前这里的默认值设置为0.20，但是这是一个相当任意的值，用户可能希望偏离不同尺寸的套装。请注意，如果没有验证，此值可能设置为0 需要设置（例如K-均值验证的情况）。
valpercent2：介于0和1之间的浮点值，指定百分比为第二次验证预留的培训数据集合（通常用于在发布之前对模型进行最终验证）。此值默认为0。（以前，默认值设置为0.10，但是是相当任意的值，用户可能希望偏离尺寸设置。）
shuffletrain：一个布尔标识符（true/false），它指示在开始验证之前，DFU列中的行将被洗牌集合。请注意，如果此值设置为false，则验证设置为将从数据帧的底部x%连续行中提取。（其中x%是验证比率的总和。）请注意，如果该值是设置为false，尽管验证将从sequential 行，validaiton1和validation2集之间的分隔将是随机化。此值默认为false。
trainlabelfreqlevel：一个布尔标识符（true/false），表示如果trainlabelfreqlevel方法将应用于过采样训练与表示不足的标签关联的数据。该方法添加倍数以较低的频率训练这些标签的数据行（近似）水平化频率。默认为false。注意如果处理适用于包括标准偏差箱的集合。
powertransform：一个布尔标识符（true/false），它指示将对要在其中选择的分发属性进行评估 box-cox、z-score、最小最大标度或平均绝对偏差标度正常化。注意，在应用box-cox转换子列之后为随后的z分数标准化以及一组存储箱生成与平均值的标准偏差数相关。请注意我不认为目前的分配财产评估方法我们将继续改进这种方法并进行进一步的研究向前看。默认为false。
bintransform：布尔标识符（真/假）表示数值集将接收bin处理，例如生成子具有布尔标识符的列的标准偏差数平均值，带值组<；-2、-2-1、-10、01、12和>；2。注意 bins和bint转换是相同的，唯一的区别是 bint转换假定列已经被规范化了而垃圾箱转换则没有。此值默认为false。
mlinfill：一个布尔标识符（true/false），它指示默认情况下，填充方法将应用于预测丢失的填充或者使用机器学习模型对数据进行不正确的格式化剩下的部分。默认为false。
可填充：一个整数，指示ml的应用程序数为了预测填充，应进行填充处理。假设对于缺失值频率较高的集合多次应用ml填充可以提高准确性，尽管注意这不是一个经过广泛检验的假设。默认为1。
randomseed：一个正整数，用作数据中随机性的种子。设置洗牌、ml填充和恐惧重要性算法。这个默认值是42，一个很好的整数。
forceToCategoricalColumns：列标题的字符串标识符列表对于那些被视为分类的列一个热编码。这可能有用，例如，对于数字编码分类集，如邮政编码或电话号码等否则将被评估为数值，并受制于正常化。*更新不再支持此项，用户可以相反，使用下面的assigncat为每个列分配不同的方法，例如为分类的“文本”类别指定一列。
numberCategory启发式：用作启发式的整数。当分类集的唯一值比这个启发式的多，它是默认的。通过顺序处理进行分类处理。默认为15。
pandasoutput：返回集合格式的选择器。默认为false 对于返回的numpy数组。如果设置为true，则返回pandas数据帧（请注意，索引未保存在列车/验证拆分中，id 可以传递列进行索引标识）。
narw_marker：一个布尔标识符（true/false），它指示返回的集合将包括带有行标记的列填充（后缀为“narw”的列）。此值默认为true。
featureselection：一个布尔标识符，告诉函数是否执行特征重要性评估。如果选中，Automunge将返回FeatureImportance中功能重要性发现的摘要返回字典。这也会激活派生集的修剪如果[FeaturePCT<；1.0和 featureMethod='pct']或如果[fesaturemetric>；0.0和featureMethod= “公制”]。注意这个默认值为false，因为没有列车组中指定的标签列。注意，用户指定的此方法中使用ValidationRatios（如果已通过）的大小。
FeaturePCT：输出中保留的派生集的百分比基于特征重要性评价。注意，narw列是暂时不包括在修剪中可能会包含在未来的扩展中）。此项仅在下列情况下使用 FeatureMethod作为“pct”传递（默认值）。
FeatureMetric：派生集低于的特征重要性度量从输出中修剪。注意，此项仅在 FeatureMethod作为“Metric”传递。
FeatureMethod:可以作为“pct”或“metric”传递以选择采用特征重要度法对衍生集进行裁剪。
PCAN U组件：A User可以传递一个整数来定义pca的数量用于降维的派生特征，例如小于否则返回的集合数。函数将默认所有非负集或稀疏pca的核pca。如果此值以浮点形式传递，然后应用线性PCA，如下所示返回的集合数是可以复制的最小数目方差的百分比。注意这也可以同时传递在ml_cmnd对象中指定pca类型或参数。
pcaexcl：要从中排除的列的列标题列表 PCA的任何应用
最大持续时间：

ML_cmnd = {'MLinfill_type':'default', \
           'MLinfill_cmnd':{'RandomForestClassifier':{}, 'RandomForestRegressor':{}}, \
           'PCA_type':'default', \
           'PCA_cmnd':{}}, \

ml-cmnd允许用户将参数传递给预测算法用于ml填充和特征重要性评估。目前唯一 “mlinfill_type”的选项是默认的，它使用scikit learn的random 林实现，目的是在将来的扩展中添加其他选项。例如，用户希望将的自定义参数max_depth传递给随机林算法可以通过： _

ML_cmnd = {'MLinfill_type':'default', \
           'MLinfill_cmnd':{'RandomForestClassifier':{'max_depth':4}, \
                            'RandomForestRegressor':{'max_depth':4}}, \
           'PCA_type':'default', \
           'PCA_cmnd':{}}, \

#(note that currently unable to pass RF parameters to criterion and n_jobs)

用户还可以为pca变换指定特定的方法。当前PCA_类型支持“pca”、“sparsepca”和“kernelpca”，全部通过scikit learn。注意，n_组件与pcan_组件是分开传递的上面提到的论点。用户还可以将参数传递给pca函数例如，可以通过pca-cmnd传递kernelpca的核类型作为：

ML_cmnd = {'MLinfill_type':'default', \
           'MLinfill_cmnd':{'RandomForestClassifier':{}, \
                            'RandomForestRegressor':{}}, \
           'PCA_type':'KernelPCA', \
           'PCA_cmnd':{'kernel':'sigmoid'}}, \

#Also note that SparsePCA currenlty doesn't have available
#n_jobs or normalize_components, and similarily KernelPCA 
#doesn't have available n_jobs.

请注意，对于列车组特征数量为0.50行。用户可以通过传递'pca_cmnd'：{'col_row_ratio'：0.22}}更改此比率实例。用户也可以通过传递'pca_cmnd'：{'pca_type'：'off'}。用户还可以排除返回的通过传递来自任何PCA应用程序的布尔（0/1）列 'pca_cmnd'：{'bool_pca_excl'：真} 或通过排除PCA应用程序中返回的布尔列和序数列 'pca_cmnd'：{'bool_ordl_pcaexcl'：真} 这样可能会节省内存。

分配类别：

#Here are the current trasnformation options built into our library, which
#we are continuing to build out. A user may also define their own.

    assigncat = {'mnmx':[], 'mnm2':[], 'mnm3':[], 'mnm4':[], 'mnm5':[], 'mnm6':[], \
                 'nmbr':[], 'nbr2':[], 'nbr3':[], 'MADn':[], 'MAD2':[], 'MAD3':[], \
                 'bins':[], 'bint':[], \
                 'bxcx':[], 'bxc2':[], 'bxc3':[], 'bxc4':[], \
                 'log0':[], 'log1':[], 'pwrs':[], \
                 'bnry':[], 'text':[], 'ordl':[], 'ord2':[], \
                 'date':[], 'dat2':[], 'wkdy':[], 'bshr':[], 'hldy':[], \
                 'excl':[], 'exc2':[], 'exc3':[], 'null':[], 'eval':[]}

用户可以将列标识符字符串添加到每个以指定这种特定的处理方法。注意此处理类别将作为在transformdict中定义的转换。注意额外的如果在传递的transformdict和处理dict。这里的一个用法示例是，如果用户只想使用z-score处理数值列“nmbrcolumn1”和“nmbrcolumn2” 标准化而不是全部的数值推导无法传递assigncat={'nbr2'：['nmbrcolumn1']，…}。我们会提供下面每个内置转换库的详细信息。

分配infl

#Here are the current infill options built into our library, which
#we are continuing to build out.
assigninfill = {'stdrdinfill':[], 'MLinfill':[], 'zeroinfill':[], 'oneinfill':[], \
                'adjinfill':[], 'meaninfill':[], 'medianinfill':[]}, \

用户可以将列标识符字符串添加到这些列表中的每个指定缺少或值格式不正确。请注意，此内嵌类别默认为 mlinfill if nothing assigned和automunge的mlinfill参数是设置为true。stdrdinfill means：数值集的平均值，最常见的是二进制的，新的列布尔的分类。零填充意味着插入缺少单元格的整数0。oneinfl意味着插入整数1。 adjinfl意味着将前一行的值传递给缺少的单元格。 meaninfl是指将列车组的平均值插入数值柱。MediaNinfill是指插入从列车派生的中间带设置为数值列。（注意当前的布尔列派生自平均值/中位数不支持数字，对于这些情况，默认为从stdrdinfill填充而成。）

transformdict：允许用户传递自定义转换树。注意，用户可以定义自己的4个字符串“根” 标识符对于使用类别的一系列处理步骤在我们的库中已定义的处理，然后分配列在assigncat中，或者对于自定义处理函数，此方法应该与只稍微复杂一点的processdict结合使用。例如，用户希望定义一组新的转换对于结合了narows，min max，box cox，z-score的数值序列“newt”，标准偏差箱可以通过传递trasnformdict作为：

transformdict =  {'newt' : {'parents' : ['bxc4'], \
                            'siblings': [], \
                            'auntsuncles' : ['mnmx'], \
                            'cousins' : ['NArw'], \
                            'children' : [], \
                            'niecesnephews' : [], \
                            'coworkers' : [], \
                            'friends' : []}}

#Where since bxc4 is passed as a parent, this will result in pulling
#ofspring keys from the bxcx family tree, which has a nbr2 key as children.

#from automunge library:
    transform_dict.update({'bxc4' : {'parents' : ['bxcx'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : ['NArw'], \
                                     'children' : ['nbr2'], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

#note that 'nmbr' is passed as a children primitize meaning if nbr2 key
#has any offspring those will be produced as well.

基本上，这里的“newt”是关键，当传递给一个家族原语时应用相应的过程函数，如果它被传递给一个族原始的和下游的后代，然后那些后代的钥匙被从那把钥匙是家谱。例如，这里mnmx作为auntsuncles传递表示应用mnmx处理函数时没有下游子代。这个 bxcx密钥作为父密钥传递，这意味着bxcx trasnform是耦合应用的对于bxcx密钥家族树的任何下游转换，我们也将展示。注：变换的族原语树可以概括为：

'parents' :           upstream / first generation / replaces column / with offspring
'siblings':           upstream / first generation / supplements column / with offspring
'auntsuncles' :       upstream / first generation / replaces column / no offspring
'cousins' :           upstream / first generation / supplements column / no offspring
'children' :          downstream parents / offspring generations / replaces column / with offspring
'niecesnephews' :     downstream siblings / offspring generations / supplements column / with offspring
'coworkers' :         downstream auntsuncles / offspring generations / replaces column / no offspring
'friends' :           downstream cousins / offspring generations / supplements column / no offspring

请注意，当我们定义一个新的变换（如上面的“newt”）时，我们还需要为新类别定义相应的processdict条目，我们在此处演示：

processdict：允许用户定义自己的处理函数对应于新的transformdict键。我们将在此处描述条目：

#for example 
processdict =  {'newt' : {'dualprocess' : None, \
			  'singleprocess' : None, \
			  'postprocess' : None, \
        	          'NArowtype' : 'numeric', \
      		          'MLinfilltype' : 'numeric', \
           		  'labelctgy' : 'mnmx'}}

#A user should pass either a pair of processing functions to both 
#dualprocess and postprocess, or alternatively just a single processing
#function to singleprocess, and pass None to those not used.
#For now, if just using the category as a root key and not as a family primitive, 
#can simply pass None to all the processing slots. We'll demonstrate their 
#composition and data structures for custom processing functions later in this 
#document.

#dualprocess: for passing a processing function in which normalization 
#             parameters are derived from properties of the training set
#             and jointly process the train set and if available test set

#singleprocess: for passing a processing function in which no normalization
#               parameters are needed from the train set to process the
#               test set, such that train and test sets processed seperately

#postprocess: for passing a processing function in which normalization 
#             parameters originally derived from the train set are applied
#             to seperately process a test set

#NArowtype: can be entries of either 'numeric', 'justNaN', or 'exclude' where
#			'numeric' refers to columns where non-numeric entries are subject
#					  to infill
#			'justNaN' refers to columns where only NaN entries are subject
#			          to infill
#			'exclude' refers to columns where no infill will be performed

#MLinfilltype: can be entries of 'numeric', 'singlct', 'multirt', 'exclude'
#              'multisp', 'exclude', or 'label' where
#			   'numeric' refers to columns where predictive algorithms treat
#			   as a regression for numeric sets
#			   'singlect' refers to columns where category gives a single
#			   column where predictive algorithms treat as a boolean classifier
#			   'multirt' refers to category returning multiple columns where 
#			   predictive algorithms treat as a multi modal classifier
#			   'exclude' refers to categories excluded from predcitive address
#			   'multisp' tbh I think this is just a duplicate of multirt, a
#			   future update may strike this one
#			   'label' refers to categories specifically intended for label
#			   processing

printstatus：用户可以通过true/false指示函数是否将打印操作过程中的处理状态。默认为true。

好的，下面我们将进一步演示如何构建自定义处理函数，现在，这只是给了您足够的工具来使用图书馆里的内置设备。

…

邮递

postunge（.）函数旨在一致地处理随后可用的只需要一个函数调用就可以得到格式一致的测试数据。它需要传递automunge原始应用程序返回的后处理dict对象并且通过的测试数据具有与原始数据一致的列标题标签火车组。


#for postmunge(.) function on subsequently available test data
#using the postprocess_dict object returned from original automunge(.) application

#Remember to initialize automunge
from AutoMunge_pkg import AutoMunge
am = AutoMunge.AutoMunge()


#Then we can run postmunge function as:

test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test = \
am.postmunge(postprocess_dict, df_test, testID_column = False, \
             labelscolumn = False, pandasoutput=True, printstatus = True, \
             TrainLabelFreqLevel = False, featureeval = False):

postunge（.）返回集合：

下面是从postmunge返回的集合的描述，其中后面是可以传递给的参数的描述功能。

测试：一组特性，作为训练数据，可用于从模型生成预测使用Automunge的NP U列车组进行培训。
test id：与测试集对应的id值集。
testlabels：一组数字编码的标签，对应于测试集是否通过了标签列。注意，函数假设标签列最初包含在列车组中。注意如果标签集是一列，则返回的numpy数组是扁平（例如，[1,2,3]]转换为[1,2,3]）
labelencoding_dict：这是从返回的相同labelencoding_dict automunge，它用于对预测的标签进行反向编码的情况
finalColumns_test：与测试数据。注意，后缀附加器的包含用于确定哪些特征工程转换应用于每个列。请注意，此列表应与automunge中的列表匹配。

…

postunge（.）传递了参数


#for postmunge(.) function on subsequently available test data
#using the postprocess_dict object returned from original automunge(.) application

#Remember to initialize automunge
from AutoMunge_pkg import AutoMunge
am = AutoMunge.AutoMunge()


#Then we can run postmunge function as:

test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test = \
am.postmunge(postprocess_dict, df_test, testID_column = False, \
             labelscolumn = False, pandasoutput=True, printstatus = True, \
             TrainLabelFreqLevel = False, featureeval = False)

后处理dict：这是从初始值返回的字典包含规范化参数的automunge在使测试数据的处理与原始处理保持一致在火车上。这需要用户记住下载字典在automunge的原始应用程序中，否则词典一不可用用户可以将此后续测试数据馈送到与原始列车数据完全一致的Automunge 原始的自动咀嚼呼叫。
df_测试：包含结构化用于从机器学习生成预测的数据集从automunge训练的模型返回集合。这一套必须始终如一格式为列顺序一致且标签是否包括一致的标签。如果需要，集合可以包括id列。这个工具支持将非索引范围列包含为索引或多列索引（需要命名索引列）。这样的索引类型被添加到返回的“id”集被一致地洗牌和分区为训练和测试设备。
testid_column：来自用作行标识符值的df_测试集（例如例如序列号）。函数默认为false 训练集不包含id列的情况。用户可以同时传递一个字符串列标题列表，例如要从处理中排除但始终分区的列。
labels column：默认值为false表示labels列不是包含在传递给postmunge的测试集中。用户可以通过 true或labels列的字符串id，注意这是一个要求标签列标题字符串必须与原来的火车组。
pandasoutput：返回集合格式的选择器。默认为false 对于返回的numpy数组。如果设置为true，则返回pandas数据帧（注意，索引没有被保留，可以为索引传递一个id列识别）。
printstatus：用户可以传递true/false来指示函数是否将在操作期间打印处理状态。默认为true。
trainlabelfreqlevel：一个布尔标识符（true/false），表示如果trainlabelfreqlevel方法将应用于过采样测试与表示不足的标签关联的数据。该方法添加倍数以较低的频率测试这些标签的数据行（近似）水平化频率。默认为false。注意如果处理适用于包括标准偏差箱的集合。
FeatureEval：激活功能的布尔标识符（真/假）重要性评估，与Automunge中的评估相当，但基于在传递给postmunge的测试集上。目前结果报告不是作为对象返回，结果将打印在输出中（用于向后兼容性）。

…

变换库

Automunge有一个内置的转换库，可以为具有assigncat的特定列。如果未分配，则列将遵从自动默认方法。例如，用户可以传递最小值和最大值将方法缩放到具有以下项的特定列“col1”：

assigncat = {'mnmx':['col1']}

当用户将列分配给特定类别时，将处理该类别作为转换树的根类别。每把钥匙都有关联的转换函数，该转换函数仅如果在族基元树中也找到根键，则应用此项。这个如前所述，家族原语树首先应用于特定于原始根键，然后在上游原语中找到键的任何变换即父母/兄弟姐妹/姨妈/表亲。如果将转换应用于原始的，包括下游的后代，如父母/ 兄弟姐妹，然后检查带有后代的密钥的家族树以确定下游子代类别，例如，如果我们的父代密钥为'mnmX'，那么“mnmx”家族中的任何孩子/侄子/同事/朋友都会分别作为父母/兄弟姐妹/姐妹/表亲应用。请注意补充/替换的名称纯粹是指应用trasnform的列将保留或移除。拜托注意，这是函数的一个怪癖，原来的列不能留在未经改造的地方因此，至少必须有一个替换原语始终包括在内。如果用户确实希望保留一列不变，则可以简单地将该列分配给“excl”根类别。

现在我们将从这里开始，再次列出这些根的家族树原语自动咀嚼库中内置的类别。在那之后我们会很快每个相关转换功能的叙述。又来了是家谱原语。

'parents' :           upstream / first generation / replaces column / with offspring
'siblings':           upstream / first generation / supplements column / with offspring
'auntsuncles' :       upstream / first generation / replaces column / no offspring
'cousins' :           upstream / first generation / supplements column / no offspring
'children' :          downstream parents / offspring generations / replaces column / with offspring
'niecesnephews' :     downstream siblings / offspring generations / supplements column / with offspring
'coworkers' :         downstream auntsuncles / offspring generations / replaces column / no offspring
'friends' :           downstream cousins / offspring generations / supplements column / no offspring

这是目前建在内部图书馆的一系列家谱。

    transform_dict.update({'nmbr' : {'parents' : ['nmbr'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : [bint]}})

    transform_dict.update({'bnry' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['bnry'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'text' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['text'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'ordl' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['ordl'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'ord2' : {'parents' : ['ord2'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : ['mnmx'], \
                                     'friends' : []}})

    transform_dict.update({'null' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['null'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'NArw' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : [NArw], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'rgrl' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['nmbr'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'nbr2' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['nmbr'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'nbr3' : {'parents' : ['nmbr'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : ['bint']}})

    transform_dict.update({'MADn' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['MADn'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'MAD2' : {'parents' : ['MAD2'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : ['nmbr'], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'MAD3' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['MAD3'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'mnmx' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['mnmx'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'mnm2' : {'parents' : ['nmbr'], \
                                     'siblings': [], \
                                     'auntsuncles' : ['mnmx'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'mnm3' : {'parents' : ['nmbr'], \
                                     'siblings': [], \
                                     'auntsuncles' : ['mnm3'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'mnm4' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['mnm3'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'mnm5' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['mnmx'], \
                                     'cousins' : ['nmbr', NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'mnm6' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['mnm6'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'mnm7' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['mnmx', 'bins'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})
    transform_dict.update({'date' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['date'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})
    transform_dict.update({'dat2' : {'parents' : [], \
                                     'siblings': ['bshr', 'wkdy', 'hldy'], \
                                     'auntsuncles' : ['date'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'bxcx' : {'parents' : ['bxcx'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : ['nmbr'], \
                                     'friends' : []}})

    transform_dict.update({'bxc2' : {'parents' : ['bxc2'], \
                                     'siblings': ['nmbr'], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : ['nmbr'], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'bxc3' : {'parents' : ['bxc3'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : ['nmbr'], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'bxc4' : {'parents' : ['bxc4'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [NArw], \
                                     'children' : ['nbr2'], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'pwrs' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['pwrs'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'log0' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['log0'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'log1' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['log0', 'pwrs'], \
                                     'cousins' : [NArw], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'wkdy' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['wkdy'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'bshr' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['bshr'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'hldy' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['hldy'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'bins' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['bins'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'bint' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['bint'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'excl' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['excl'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

    transform_dict.update({'exc2' : {'parents' : ['exc2'], \
                                     'siblings': [], \
                                     'auntsuncles' : [], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : ['bins'], \
                                     'friends' : []}})

    transform_dict.update({'exc3' : {'parents' : [], \
                                     'siblings': [], \
                                     'auntsuncles' : ['exc2'], \
                                     'cousins' : [], \
                                     'children' : [], \
                                     'niecesnephews' : [], \
                                     'coworkers' : [], \
                                     'friends' : []}})

下面是相关转换函数的快速描述每个键都可以分配给一个基元根键）。我们将继续构建这个转换库。

narw：为源中的行生成一列布尔标识符值丢失或格式不正确的列。
NMBR/NBR2/NBR3:Z评分标准化
madn/mad2：平均绝对偏差标准化，减去集合平均值
mnmx/mnm2/mnm5：香草最小最大缩放
mnm3/mnm4：最小最大标度，异常值上限为0.01和0.99分位数
mnm6：最小-最大标度，测试集上限为列车组的最小/最大值
bnry:将具有两个值的集合转换为布尔标识符
文本：将分类集转换为一组热编码的布尔标识符
ORDL/ORD2:将分类集转换为整数标识符的顺序编码集
bxcx/bxc2/bxc3/bxc4:执行box-cox幂律变换
log0/log1:执行对数变换（基数10）
压水堆：按10的幂进行垃圾箱分组
日期/日期2：对于日期时间格式的数据，按时间刻度将数据分隔为多个列（年/月/日/小时/分钟/秒），然后执行z分数标准化
wkdy：指示datetime对象是否为工作日的布尔标识符
指示DateTime对象是否为业务的布尔标识符小时（9-5，时区不知道）
hldy：指示datetime对象是否为美国联邦的布尔标识符假日
bins：对于数值集，输出一组6列，指示值相对于设置（即<；-2、-2-1、-10、01、12、>；2）
bint：与bins类似，只是假设源数据已经规范化了
空：删除源列
excl:未更改的传递源列
EXC2:传递未更改的源列，但填充除外
eval：执行与automunge一致的分布属性计算 “powertransform”参数到指定列

…

自定义转换函数

好的，最后一项议程，我们将演示创建自定义转换函数，以便用户可以自定义特性工程同时建立在Automunge所有非常有用的内置功能之上作为填充方法，包括ml填充、特征重要性、降维，也许最重要的是最简单的一致性处理方法只有一个函数调用的后续可用数据。转变需要通过pandas和incorporat传递函数一把简单的数据结构，我们将在下面演示。

假设我们想重新创建mm3类别，它将异常值限制在0.01和0.99 分位数，但是改为0.001和0.999分位数。我们称之为餐饮店mnm8。因此，为了传递自定义转换函数，首先需要定义新的根类别trasnformdict和相应的processdict。

#Let's creat ea really simple family tree for the new root category mnmn8 which
#simply creates a column identifying any rows subject to infill (NArw), performs 
#a z-score normalization, and seperately performs a version of the new transform
#mnm8 which we'll define below.

transformdict = {'mnm8' : {'parents' : [], \
                           'siblings': [], \
                           'auntsuncles' : ['mnm8', 'nmbr'], \
                           'cousins' : ['NArw'], \
                           'children' : [], \
                           'niecesnephews' : [], \
                           'coworkers' : [], \
                           'friends' : []}, \

#Note that since this mnm8 requires passing normalization parameters derived
#from the train set to process the test set, we'll need to create twop sep[erate 
#trasnformations functions, the first a "dualprocess" function that processes
#both the train and if available a test set swimultaneously, and the second
#a "postprocess" that only processes the test set on it's own.

#So what's being demosnrtated here is that we're passing the functions under
#dualprocess and postprocess that we'll define below.

processdict = {'mnm8' : {'dualprocess' : process_mnm8_class, \
                         'singleprocess' : None, \
                         'postprocess' : postprocess_mnm8_class, \
                         'NArowtype' : 'numeric', \
                         'MLinfilltype' : 'numeric', \
                         'labelctgy' : 'mnm8'}}

#Now we have to define the custom processing functions which we are passing through
#the processdict to automunge.

#Insterad of demosntrating the full functions, I'll just demonstrate the
#requirements


#Here we'll define a "dualprocess" function intended to process both a train and
#test set simulateously. We'll also need to create a seperate "postprocess"
#function intended to just process the test set.

#define the function
def process_mnm8_class(mdf_train, mdf_test, column, category, \
                       postprocess_dict):
  #where
  #mdf_train is the train data set (pandas dataframe)
  #mdf_test is the consistently formatted test dataset (if no test data 
  #set is available a dummy set will be passed in it's place)
  #column is the string identifying the column header
  #category is the 4 charcter string category identifier, here is will be 'mnm8'
  #postprocess_dict is an object we pass to share data between functions if needed

  #create thee new column, using the catehgory key as a suffix identifier

  #copy source column into new column
  mdf_train[column + '_mnm8'] = mdf_train[column].copy()
  mdf_test[column + '_mnm8'] = mdf_test[column].copy()


  #perform an initial infill method, here we use mean as a plug, automunge
  #will seperately perform a infill method per user specifications elsewhere
  #convert all values to either numeric or NaN
  mdf_train[column + '_mnm8'] = pd.to_numeric(mdf_train[column + '_mnm8'], errors='coerce')
  mdf_test[column + '_mnm8'] = pd.to_numeric(mdf_test[column + '_mnm8'], errors='coerce')



  #Now we do the specifics of the processing function, here we're demonstrating
  #the min-max scaling method capping values at 0.001 and 0.999 quantiles

  #get maximum value of training column
  quantilemax = mdf_train[column + '_mnm8'].quantile(.999)

  #get minimum value of training column
  quantilemin = mdf_train[column + '_mnm8'].quantile(.001)

  #replace values > quantilemax with quantilemax
  mdf_train.loc[mdf_train[column + '_mnm8'] > quantilemax, (column + '_mnm8')] \
  = quantilemax
  mdf_test.loc[mdf_train[column + '_mnm8'] > quantilemax, (column + '_mnm8')] \
  = quantilemax
  #replace values < quantile10 with quantile10
  mdf_train.loc[mdf_train[column + '_mnm8'] < quantilemin, (column + '_mnm8')] \
  = quantilemin
  mdf_test.loc[mdf_train[column + '_mnm8'] < quantilemin, (column + '_mnm8')] \
  = quantilemin


  #note the infill method is now completed after the quantile evaluation / replacement
  #get mean of training data
  mean = mdf_train[column + '_mnm8'].mean()    
  #replace missing data with training set mean
  mdf_train[column + '_mnm8'] = mdf_train[column + '_mnm8'].fillna(mean)
  mdf_test[column + '_mnm8'] = mdf_test[column + '_mnm8'].fillna(mean)


  #perform min-max scaling to train and test sets using values from train
  mdf_train[column + '_mnm8'] = (mdf_train[column + '_mnm8'] - quantilemin) / \
                                (quantilemax - quantilemin)
  mdf_test[column + '_mnm8'] = (mdf_test[column + '_mnm8'] - quantilemin) / \
                               (quantilemax - quantilemin)


  #ok here's where we populate the data structures

  #create list of columns (here it will only be one column returned)
  nmbrcolumns = [column + '_mnm8']

  #The normalization dictionary is how we pass values between the "dualprocess"
  #function and the "postprocess" function

  #Here we populate the normalization dictionary with any values derived from
  #the train set that we'll need to process the test set.
  nmbrnormalization_dict = {column + '_mnm8' : {'quantilemin' : quantilemin, \
                                                'quantilemax' : quantilemax, \
                                                'mean' : mean}}

  #the column_dict_list is returned from the function call and supports the 
  #automunge methods. We populate it as follows:

  #initialize
  column_dict_list = []

  #where we're storing following
  #{'category' : 'mnm8', \ -> identifier of the category fo transform applied
  # 'origcategory' : category, \ -> category of original column in train set, passed in function call
  # 'normalization_dict' : nmbrnormalization_dict, \ -> normalization parameters of train set
  # 'origcolumn' : column, \ -> ID of original column in train set
  # 'columnslist' : nmbrcolumns, \ -> a list of columns created in this transform, 
  #                                  later fleshed out to include all columns derived from same source column
  # 'categorylist' : [nc], \ -> a list of columns created in this transform
  # 'infillmodel' : False, \ -> populated elsewhere, for now enter False
  # 'infillcomplete' : False, \ -> populated elsewhere, for now enter False
  # 'deletecolumn' : False}} -> populated elsewhere, for now enter False

  for nc in nmbrcolumns:

    if nc[-5:] == '_mnm8':

      column_dict = { nc : {'category' : 'mnm8', \
                           'origcategory' : category, \
                           'normalization_dict' : nmbrnormalization_dict, \
                           'origcolumn' : column, \
                           'columnslist' : nmbrcolumns, \
                           'categorylist' : [nc], \
                           'infillmodel' : False, \
                           'infillcomplete' : False, \
                           'deletecolumn' : False}}

      column_dict_list.append(column_dict.copy())



  return mdf_train, mdf_test, column_dict_list

  #where mdf_train and mdf_test now have the new column incorporated
  #and column_dict_list carries the data structures supporting the operation 
  #of automunge. (If the original columkjn was intended for replacement it 
  #will be stricken elsewhere)


#and then since this is a method that passes values between the train
#and test sets, we'll need to define a corresponding "postproces" function
#intended for use on just the test set

def postprocess_mnm3_class(mdf_test, column, postprocess_dict, columnkey):
  #where mdf_test is a dataframe fo the test set
  #column is the string of the column header
  #postprocess_dict is how we carry packets of datra between the 
  #functions in automunge
  #columnkey is a key used to access stuff in postprocess_dict if needed


  #retrieve normalization parameters from postprocess_dict
  normkey = column + '_mnm8'

  mean = \
  postprocess_dict['column_dict'][normkey]['normalization_dict'][normkey]['mean']

  quantilemin = \
  postprocess_dict['column_dict'][normkey]['normalization_dict'][normkey]['quantilemin']

  quantilemax = \
  postprocess_dict['column_dict'][normkey]['normalization_dict'][normkey]['quantilemax']

  #copy original column for implementation
  mdf_test[column + '_mnm8'] = mdf_test[column].copy()


  #convert all values to either numeric or NaN
  mdf_test[column + '_mnm8'] = pd.to_numeric(mdf_test[column + '_mnm8'], errors='coerce')

  #get mean of training data
  mean = mean  

  #replace missing data with training set mean
  mdf_test[column + '_mnm8'] = mdf_test[column + '_mnm8'].fillna(mean)

  #perform min-max scaling to test set using values from train
  mdf_test[column + '_mnm8'] = (mdf_test[column + '_mnm8'] - quantilemin) / \
                               (quantilemax - quantilemin)


  return mdf_test

#Voila

#One more demonstration, note that if we didn't need to pass any properties
#between the train and test set, we could have just processed one at a time,
#and in that case we wouldn't need to define seperate functions for 
#dualprocess and postprocess, we could just define what we call a singleprocess 
#function incorproating similar data strucures but without only a single dataframe 
#passed

#Such as:
def process_mnm4_class(df, column, category, postprocess_dict):

  #etc

  return return df, column_dict_list

#For a full demonstration check out my essay 
"Automunge 1.79: An Open Source Platform for Feature Engineering"

现在你已经有了所有你需要的数据 Automunge平台。欢迎反馈。

…

作为引文，请注意automunge包使用熊猫馆、科学馆和纽比图书馆。

韦斯·麦金尼。python中用于统计计算的数据结构，第九届蟒蛇科学大会论文集，51-56（2010） publisher link

法比安·佩德雷戈萨，盖尔·瓦洛魁，亚历山大格雷夫特，文森特·米歇尔，伯特兰·蒂里翁，奥利维尔·格里塞尔，马修·布朗德尔，彼得·普雷滕霍夫，罗恩·韦斯，文森特·杜堡，杰克·范德普拉斯，亚历山大·帕索斯，大卫库尔纳佩、马蒂厄·布鲁彻、马蒂厄·佩罗、埃杜阿尔德·杜切斯内。 scikit learn:python中的机器学习，机器学习杂志研究，122285-2830（2011）publisher link

抱歉，我不知道该引用哪篇论文，但Numpy网站： https://www.numpy.org/

…

尽情咀嚼吧！

…

您可以通过记录开发介质here或更多我最近完成了我的第一本论文集，题为“从《约翰·亨利的日记》也有中译本 here。

automunge网站位于url automunge.com。

…

正在申请专利

欢迎加入QQ群-->： 979659372

AutoMunge-pkg 2.42

AutoMunge-pkg的Python项目详细描述

自动咀嚼套装

Automunge返回的集合：

automunge（.）传递了参数

邮递

postunge（.）返回集合：

postunge（.）传递了参数

变换库

自定义转换函数

推荐PyPI第三方库

django_hipster_api

rednose

pycopy-smtpd

odoo10-addon-l10n-it-corrispettivi-sale

aws-cdk.aws-securityhub

wasp-launcher

pyfluence

flake8-expandtab

dj-jkabachcha

color-tol

begin2018

flavio

grpcio-health-checking

django-common

imreg_dft

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

AutoMunge-pkg 2.42

AutoMunge-pkg的Python项目详细描述

自动咀嚼套装

Automunge返回的集合：

automunge（.）传递了参数

邮递

postunge（.）返回集合：

postunge（.）传递了参数

变换库

自定义转换函数

推荐PyPI第三方库

django_hipster_api

rednose

pycopy-smtpd

odoo10-addon-l10n-it-corrispettivi-sale

aws-cdk.aws-securityhub

wasp-launcher

pyfluence

flake8-expandtab

dj-jkabachcha

color-tol

begin2018

flavio

grpcio-health-checking

django-common

imreg_dft

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签