提供tpot、auto-sklearn和openml之间兼容层的包装工具箱
arbok的Python项目详细描述
arbok(automl wrapper工具box用于openml c兼容性)为tpot和auto sklearn提供包装,作为 这些工具与openml之间的兼容层。
包装器扩展了sklearn的BaseSearchCV,并提供了 openml需要的内部参数,例如cv_results_, best_index_、best_params_、best_score_和classes_。
安装
pip install arbok
简单示例
importopenmlfromarbokimportAutoSklearnWrapper,TPOTWrappertask=openml.tasks.get_task(31)dataset=task.get_dataset()# Get the AutoSklearn wrapper and pass parameters like you would to AutoSklearnclf=AutoSklearnWrapper(time_left_for_this_task=3600,per_run_time_limit=360)# Or get the TPOT wrapper and pass parameters like you would to TPOTclf=TPOTWrapper(generations=100,population_size=100,verbosity=2)# Execute the taskrun=openml.runs.run_model_on_task(task,clf)run.publish()print('URL for run: %s/run/%d'%(openml.config.server,run.run_id))
预处理数据
为了使包装器更加健壮,我们需要对数据进行预处理。我们可以 填充缺少的值,然后对分类数据进行一次热编码。
首先,我们得到一个掩码,它告诉我们一个特性是否是一个分类的 特征与否。
dataset=task.get_dataset()_,categorical=dataset.get_data(return_categorical_indicator=True)categorical=categorical[:-1]# Remove last index (which is the class)
接下来,我们为预处理设置一个管道。我们正在使用 ConditionalImputer,这是一个能够使用 分类(名词性)和数值数据的不同策略。
fromsklearn.pipelineimportmake_pipelinefromsklearn.preprocessingimportOneHotEncoderfromarbokimportConditionalImputerpreprocessor=make_pipeline(ConditionalImputer(categorical_features=categorical,strategy="mean",strategy_nominal="most_frequent"),OneHotEncoder(categorical_features=categorical,handle_unknown="ignore",sparse=False))
最后,我们把所有的东西都放在一个包装袋里。
clf=AutoSklearnWrapper(preprocessor=preprocessor,time_left_for_this_task=3600,per_run_time_limit=360)
限制
- 目前只实现了分类器。回归是 因此不可能。
- 对于tpot,无法设置config_dict变量,因为 导致API出现问题。
基准
安装arbok包包括arbenchcli工具。我们 可以生成这样的json文件:
fromarbok.benchimportBenchmarkbench=Benchmark()config_file=bench.create_config_file(# Wrapper parameterswrapper={"refit":True,"verbose":False,"retry_on_error":True},# TPOT parameterstpot={"max_time_mins":6,# Max total time in minutes"max_eval_time_mins":1# Max time per candidate in minutes},# Autosklearn parametersautosklearn={"time_left_for_this_task":360,# Max total time in seconds"per_run_time_limit":60# Max time per candidate in seconds})
然后,我们可以这样调用arbench:
arbench --classifier autosklearn --task-id 31 --config config.json
或者将arbok作为python模块调用:
python -m arbok --classifier autosklearn --task-id 31 --config config.json
在批处理系统上运行基准
要运行大规模基准测试,我们可以创建一个配置文件,如 生成作业并将其提交给批处理系统,如下所示。
# We create a benchmark setup where we specify the headers, the interpreter we# want to use, the directory to where we store the jobs (.sh-files), and we give# it the config-file we created earlier.bench=Benchmark(headers="#PBS -lnodes=1:cpu3\n#PBS -lwalltime=1:30:00",python_interpreter="python3",# Path to interpreterroot="/path/to/project/",jobs_dir="jobs",config_file="config.json",log_file="log.json")# Create the config file like we did in the section aboveconfig_file=bench.create_config_file(# Wrapper parameterswrapper={"refit":True,"verbose":False,"retry_on_error":True},# TPOT parameterstpot={"max_time_mins":6,# Max total time in minutes"max_eval_time_mins":1# Max time per candidate in minutes},# Autosklearn parametersautosklearn={"time_left_for_this_task":360,# Max total time in seconds"per_run_time_limit":60# Max time per candidate in seconds})# Next, we load the tasks we want to benchmark on from OpenML.# In this case, we load a list of task id's from study 99.tasks=openml.study.get_study(99).tasks# Next, we create jobs for both tpot and autosklearn.bench.create_jobs(tasks,classifiers=["tpot","autosklearn"])# And finally, we submit the jobs using qsubbench.submit_jobs()
预处理参数
fromarbokimportParamPreprocessorimportnumpyasnpfromsklearn.feature_selectionimportVarianceThresholdfromsklearn.pipelineimportmake_pipelineX=np.array([[1,2,True,"foo","one"],[1,3,False,"bar","two"],[np.nan,"bar",None,None,"three"],[1,7,0,"zip","four"],[1,9,1,"foo","five"],[1,10,0.1,"zip","six"]],dtype=object)# Manually specify types, or use types="detect" to automatically detect typestypes=["numeric","mixed","bool","nominal","nominal"]pipeline=make_pipeline(ParamPreprocessor(types="detect"),VarianceThreshold())pipeline.fit_transform(X)
输出:
[[-0.4472136 -0.4472136 1.41421356 -0.70710678 -0.4472136 -0.4472136 2.23606798 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136 -0.85226648 1. ] [-0.4472136 2.23606798 -0.70710678 -0.70710678 -0.4472136 -0.4472136 -0.4472136 -0.4472136 -0.4472136 2.23606798 0.4472136 -0.4472136 -0.5831297 -1. ] [ 2.23606798 -0.4472136 -0.70710678 -0.70710678 -0.4472136 -0.4472136 -0.4472136 -0.4472136 2.23606798 -0.4472136 -2.23606798 2.23606798 -1.39054004 -1. ] [-0.4472136 -0.4472136 -0.70710678 1.41421356 -0.4472136 2.23606798 -0.4472136 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136 0.49341743 -1. ] [-0.4472136 -0.4472136 1.41421356 -0.70710678 2.23606798 -0.4472136 -0.4472136 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136 1.031691 1. ] [-0.4472136 -0.4472136 -0.70710678 1.41421356 -0.4472136 -0.4472136 -0.4472136 2.23606798 -0.4472136 -0.4472136 0.4472136 -0.4472136 1.30082778 1. ]]