我正在尝试在Google云平台(GCP)ML引擎(又名AI平台)上实现这个分布式Keras调谐器示例: https://github.com/keras-team/keras-tuner/blob/master/docs/templates/tutorials/distributed-tuning.md
这是我的ML培训输入。yaml:
scaleTier : CUSTOM
masterType: standard
masterConfig:
imageUri: tensorflow/tensorflow:2.1.0-gpu-py3
workerCount: 8
workerType: standard_gpu
workerConfig:
imageUri: tensorflow/tensorflow:2.1.0-gpu-py3
在python脚本的顶部,我添加了:
tf_config = json.loads(os.environ['TF_CONFIG'])
cluster = tf_config['cluster']
task = tf_config['task']
master_addr = cluster['master'][0].split(':')
os.environ['KERASTUNER_ORACLE_IP'] = master_addr[0]
os.environ['KERASTUNER_ORACLE_PORT'] = '8000'
if task['type'] == 'master':
os.environ['KERASTUNER_TUNER_ID'] = 'chief'
else:
os.environ['KERASTUNER_TUNER_ID'] = 'tuner{}'.format(task['index'])
不幸的是,这不起作用。主机返回错误:
server_chttp2.cc:40] {"created":"@1580940408.588629852","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1580940408.588623412","description":"Unable to configure socket","fd":22,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1580940408.588609041","description":"Cannot assign requested address","errno":99,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Cannot assign requested address","syscall":"bind"}]}]}
因此,主机似乎无法绑定到侦听端口
所以,我想真正的问题是:如何绑定到GCPML引擎上的侦听端口?允许这样做吗
对于现在如何在GCP ML引擎上运行分布式Keras调优的任何见解,我们将不胜感激
KERASTUNER_ORACLE_IP需要IP地址,而不是主机名
这是我在项目中使用的函数,请参见https://github.com/vlasenkoalexey/gcp_runner/blob/master/entry_point.ipynb
对于TF2.x,TF_CONFIG中的“master”替换为“chief”。您可以传递 use-chief-in-tf-config以更新它。 确认它在谷歌人工智能平台和Kubernetes上工作
正如错误消息所说,我和OP有类似的问题。我不确定真正的原因是什么,但对我来说,解决办法是为主管绑定0.0.0.0(即
os.environ['KERASTUNER_ORACLE_IP'] = '0.0.0.0'
),同时仍然使用来自TF_CONFIG的主管IP相关问题 更多 >
编程相关推荐