回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我不知道这是从哪里来的,或者为什么会发生此错误:</p>
<p>集群可以使用yaml启动,但是当我查看日志时,会发现这个错误</p>
<p>尽管出现了错误,它仍在工作吗?如何从docker图像中检查打印输出</p>
<p>雷似乎没有任何“有效”的例子可以效仿。我正在尝试启动aws docker群集的最简单版本,以证明其原理</p>
<pre><code> ray exec /home/user/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Fetched IP: xxxxxxxxx
Warning: Permanently added 'xxxxxxxxx' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
redis_password=args.redis_password)
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
self.load_metrics)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
self.reset(errors_fatal=True)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'
==> /tmp/ray/session_latest/logs/monitor.log <==
==> /tmp/ray/session_latest/logs/monitor.out <==
Shared connection to 18.130.107.42 closed.
Error: Command failed:
ssh -tt -i /home/joe/.ssh/aws_ubuntu_test.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ff32489f9/8dbdda48fb/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@xxxxxxxx bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it my_simple_docker_container /bin/bash -c '"'"'"'"'"'"'"'"'bash --login -c -i '"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"' )'"'"''
(base) xxxxx:~/RAY_AWS_DOCKER/3xxxxx/aws_docker_simple$ ray exec /home/xxxxxxxxx/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: xxxxxx
Warning: Permanently added 'xxxxxxxx' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
redis_password=args.redis_password)
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
self.load_metrics)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
self.reset(errors_fatal=True)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'
</code></pre>
<p>Dockerfile:</p>
<pre><code>FROM continuumio/miniconda3:4.7.10
CMD ["mkdir", "hello_folder"]
CMD ["echo", "Hello StackOverflow!"]
</code></pre>
<p>亚马尔:</p>
<pre><code>cluster_name: simple
min_workers: 0
max_workers: 2
docker:
image: "xxxxxx/simple "
container_name: "my_simple_docker_container"
pull_before_run: True
idle_timeout_minutes: 5
initialization_commands:
# - curl https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh --output anaconda.sh
# - bash anaconda.sh
# - conda install python=3.8
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install -y python-setuptools
- sudo apt-get install -y build-essential curl unzip psmisc
- pip install --upgrade pip
- pip install discord
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
file_mounts_sync_continuously: False
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
InstanceMarketOptions:
MarketType: spot
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
setup_commands:
- conda install python=3.7
- conda create --name ray
- conda activate ray
- conda install --name ray pip
- pip install --upgrade pip
- pip install discord
- pip install ray
head_setup_commands:
- pip install boto3==1.4.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
</code></pre>