在dockerbased AmlCompute上使用azureml中较大文件数据集的最佳方法

2024-05-20 08:46:21 发布

您现在位置：Python中文网/ 问答频道 /正文

1327

网友

男 | 程序猿一只，喜欢编程写python代码。

在提交基于估计器的运行（启用docker）时，在AmlCompute上使用FileDataSet的推荐方法是什么

我的文件数据集大约为1.5Gb，包含1000个图像。
我有一个表格数据集，其中引用了该文件数据集中的图像。此表格数据集包含对其他（掩码）图像的类或引用，具体取决于我尝试训练的模型

因此，为了将图像加载到内存（np.array），我必须根据tablerdataset中的文件名从文件位置读取图像

在这一点上，我看到了两个选项，但没有一个是可行的，因为它们需要花费时间（+1小时）才能完成，而且根本不可行：

装载文件数据集

image_dataset = ws.datasets['imagedata']
mounted_images = image_dataset.mount()
mounted_images.start()
print('Data set mounted', datetime.datetime.now())

load_image(mounted_images.mount_point + '/myfilename.png')

下载数据集

image_dataset = ws.datasets['chart-imagedata']
image_dataset.download(target_path='chartimages', overwrite=False)

我想以最快的方式在AmlCompute上启动估计器，并尽可能快速、轻松地访问文件

我看了一下stackoverflow上的这个post，他们表示最好在train.py脚本中更新azureml sdk包，我已经应用了它，但没有区别

已编辑（更多信息）：

数据源是Azure Blob存储（存储帐户已启用ADLS 2.0）
大小为STANDARD_D2_V2的我的计算目标（0-4的集群，但仅使用1个节点）的大小

我正在使用的train.py（仅用于复制目的）：

# Force latest prerelease version of certain packages
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "--pre", package])

install('azureml-core')
install('azureml-sdk')

# General references
import argparse
import os
import numpy as np
import pandas as pd
import datetime

from azureml.core import Workspace, Dataset, Datastore, Run, Experiment

import sys
import time

ws = Run.get_context().experiment.workspace

# Download file data set
print('Downloading data set', datetime.datetime.now())
image_dataset = ws.datasets['chart-imagedata']
image_dataset.download(target_path='chartimages', overwrite=False)
print('Data set downloaded', datetime.datetime.now())


# mount file data set
print('Mounting data set', datetime.datetime.now())
image_dataset = ws.datasets['chart-imagedata']
mounted_images = image_dataset.mount()
mounted_images.start()
print('Data set mounted', datetime.datetime.now())

print('Training finished')

我用的是张量流估计器：

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException


# Choose a name for your CPU cluster
gpu_cluster_name = "g-train-cluster"

# Verify that cluster does not exist already
try:
    gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4, min_nodes=0)
    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
    print('Creating new cluster')

constructor_parameters = {
    'source_directory':training_name,
    'script_params':script_parameters,
    'compute_target':gpu_cluster,
    'entry_script':'train.py',
    'pip_requirements_file':'requirements.txt', 
    'use_gpu':True,
    'framework_version': '2.0',
    'use_docker':True}

estimator = TensorFlow(**constructor_parameters)
run = self.__experiment.submit(estimator)

Tags：数据 name 图像 image import datetime ws gpu

0条回答

目前没有回答

在dockerbased AmlCompute上使用azureml中较大文件数据集的最佳方法

相关问题更多 >

编程相关推荐

热门问题

热门文章

在dockerbased AmlCompute上使用azureml中较大文件数据集的最佳方法

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >