提供tpot、auto-sklearn和openml之间兼容层的包装工具箱

arbok的Python项目详细描述


arbok(automl wrapper工具box用于openml c兼容性)为tpot和auto sklearn提供包装,作为 这些工具与openml之间的兼容层。

包装器扩展了sklearn的BaseSearchCV,并提供了 openml需要的内部参数,例如cv_results_best_index_best_params_best_score_classes_

安装

pip install arbok

简单示例

importopenmlfromarbokimportAutoSklearnWrapper,TPOTWrappertask=openml.tasks.get_task(31)dataset=task.get_dataset()# Get the AutoSklearn wrapper and pass parameters like you would to AutoSklearnclf=AutoSklearnWrapper(time_left_for_this_task=3600,per_run_time_limit=360)# Or get the TPOT wrapper and pass parameters like you would to TPOTclf=TPOTWrapper(generations=100,population_size=100,verbosity=2)# Execute the taskrun=openml.runs.run_model_on_task(task,clf)run.publish()print('URL for run: %s/run/%d'%(openml.config.server,run.run_id))

预处理数据

为了使包装器更加健壮,我们需要对数据进行预处理。我们可以 填充缺少的值,然后对分类数据进行一次热编码。

首先,我们得到一个掩码,它告诉我们一个特性是否是一个分类的 特征与否。

dataset=task.get_dataset()_,categorical=dataset.get_data(return_categorical_indicator=True)categorical=categorical[:-1]# Remove last index (which is the class)

接下来,我们为预处理设置一个管道。我们正在使用 ConditionalImputer,这是一个能够使用 分类(名词性)和数值数据的不同策略。

fromsklearn.pipelineimportmake_pipelinefromsklearn.preprocessingimportOneHotEncoderfromarbokimportConditionalImputerpreprocessor=make_pipeline(ConditionalImputer(categorical_features=categorical,strategy="mean",strategy_nominal="most_frequent"),OneHotEncoder(categorical_features=categorical,handle_unknown="ignore",sparse=False))

最后,我们把所有的东西都放在一个包装袋里。

clf=AutoSklearnWrapper(preprocessor=preprocessor,time_left_for_this_task=3600,per_run_time_limit=360)

限制

  • 目前只实现了分类器。回归是 因此不可能。
  • 对于tpot,无法设置config_dict变量,因为 导致API出现问题。

基准

安装arbok包包括arbenchcli工具。我们 可以生成这样的json文件:

fromarbok.benchimportBenchmarkbench=Benchmark()config_file=bench.create_config_file(# Wrapper parameterswrapper={"refit":True,"verbose":False,"retry_on_error":True},# TPOT parameterstpot={"max_time_mins":6,# Max total time in minutes"max_eval_time_mins":1# Max time per candidate in minutes},# Autosklearn parametersautosklearn={"time_left_for_this_task":360,# Max total time in seconds"per_run_time_limit":60# Max time per candidate in seconds})

然后,我们可以这样调用arbench:

arbench --classifier autosklearn --task-id 31 --config config.json

或者将arbok作为python模块调用:

python -m arbok --classifier autosklearn --task-id 31 --config config.json

在批处理系统上运行基准

要运行大规模基准测试,我们可以创建一个配置文件,如 生成作业并将其提交给批处理系统,如下所示。

# We create a benchmark setup where we specify the headers, the interpreter we# want to use, the directory to where we store the jobs (.sh-files), and we give# it the config-file we created earlier.bench=Benchmark(headers="#PBS -lnodes=1:cpu3\n#PBS -lwalltime=1:30:00",python_interpreter="python3",# Path to interpreterroot="/path/to/project/",jobs_dir="jobs",config_file="config.json",log_file="log.json")# Create the config file like we did in the section aboveconfig_file=bench.create_config_file(# Wrapper parameterswrapper={"refit":True,"verbose":False,"retry_on_error":True},# TPOT parameterstpot={"max_time_mins":6,# Max total time in minutes"max_eval_time_mins":1# Max time per candidate in minutes},# Autosklearn parametersautosklearn={"time_left_for_this_task":360,# Max total time in seconds"per_run_time_limit":60# Max time per candidate in seconds})# Next, we load the tasks we want to benchmark on from OpenML.# In this case, we load a list of task id's from study 99.tasks=openml.study.get_study(99).tasks# Next, we create jobs for both tpot and autosklearn.bench.create_jobs(tasks,classifiers=["tpot","autosklearn"])# And finally, we submit the jobs using qsubbench.submit_jobs()

预处理参数

fromarbokimportParamPreprocessorimportnumpyasnpfromsklearn.feature_selectionimportVarianceThresholdfromsklearn.pipelineimportmake_pipelineX=np.array([[1,2,True,"foo","one"],[1,3,False,"bar","two"],[np.nan,"bar",None,None,"three"],[1,7,0,"zip","four"],[1,9,1,"foo","five"],[1,10,0.1,"zip","six"]],dtype=object)# Manually specify types, or use types="detect" to automatically detect typestypes=["numeric","mixed","bool","nominal","nominal"]pipeline=make_pipeline(ParamPreprocessor(types="detect"),VarianceThreshold())pipeline.fit_transform(X)

输出:

[[-0.4472136  -0.4472136   1.41421356 -0.70710678 -0.4472136  -0.4472136
   2.23606798 -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
  -0.85226648  1.        ]
 [-0.4472136   2.23606798 -0.70710678 -0.70710678 -0.4472136  -0.4472136
  -0.4472136  -0.4472136  -0.4472136   2.23606798  0.4472136  -0.4472136
  -0.5831297  -1.        ]
 [ 2.23606798 -0.4472136  -0.70710678 -0.70710678 -0.4472136  -0.4472136
  -0.4472136  -0.4472136   2.23606798 -0.4472136  -2.23606798  2.23606798
  -1.39054004 -1.        ]
 [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136   2.23606798
  -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
   0.49341743 -1.        ]
 [-0.4472136  -0.4472136   1.41421356 -0.70710678  2.23606798 -0.4472136
  -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
   1.031691    1.        ]
 [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136  -0.4472136
  -0.4472136   2.23606798 -0.4472136  -0.4472136   0.4472136  -0.4472136
   1.30082778  1.        ]]

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java为什么这段代码要打印空字符串?   java未知错误:由于标签页崩溃,会话被删除   布尔型上的java函数if-else   java最佳蚂蚁教程/快速入门?   JAVAutil。java中的扫描程序跳过do while循环中的扫描程序输入   java我们可以在selenium中使用ExpectedConditions和FluentWait来实现通用等待方法吗?   java如何使用gson库解析JSONObject   java GWT模拟Android LinearLayout的布局重量属性?   Java正则表达式重写日期表达式   java MediaPlayer播放我的MP3文件,但非常安静   java背景不显示javaFX   用于CLI的java quarkus开发模式,如何重新启动应用程序   websphere WSJdbcDataSource的java Jar文件   java Spring 4对象不会自动连接变量   java从Dbpedia定制本体/RDF图