Dask Kubernetes工人吊舱给出错误状态

(base) [root@k8s-master example]# ls dask_example.py worker-spec.yml (base) [root@k8s-master example]# nohup python dask_example.py & [1] 3660 (base) [root@k8s-master example]# cat nohup.out distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://172.16.0.76:40119 distributed.scheduler - INFO - Receive client connection: Client-df4caa18-0bc8-11ea-8e4c-12bd5ffa93ff distributed.core - INFO - Starting established connection (base) [root@k8s-master example]# kubectl get pods -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default workerpod 1/1 Running 0 70s 10.32.0.2 worker-node1 <none> <none> kube-system coredns-5644d7b6d9-l4jsd 1/1 Running 0 8m19s 10.32.0.4 k8s-master <none> <none> kube-system coredns-5644d7b6d9-q679h 1/1 Running 0 8m19s 10.32.0.3 k8s-master <none> <none> kube-system etcd-k8s-master 1/1 Running 0 7m16s 172.16.0.76 k8s-master <none> <none> kube-system kube-apiserver-k8s-master 1/1 Running 0 7m1s 172.16.0.76 k8s-master <none> <none> kube-system kube-controller-manager-k8s-master 1/1 Running 0 7m27s 172.16.0.76 k8s-master <none> <none> kube-system kube-proxy-ctgj8 1/1 Running 0 5m7s 172.16.0.114 worker-node2 <none> <none> kube-system kube-proxy-f78bm 1/1 Running 0 8m18s 172.16.0.76 k8s-master <none> <none> kube-system kube-proxy-ksk59 1/1 Running 0 5m15s 172.16.0.31 worker-node1 <none> <none> kube-system kube-scheduler-k8s-master 1/1 Running 0 7m2s 172.16.0.76 k8s-master <none> <none> kube-system weave-net-q2zwn 2/2 Running 0 6m22s 172.16.0.76 k8s-master <none> <none> kube-system weave-net-r9tzs 2/2 Running 0 5m15s 172.16.0.31 worker-node1 <none> <none> kube-system weave-net-tm8xx 2/2 Running 0 5m7s 172.16.0.114 worker-node2 <none> <none> (base) [root@k8s-master example]# kubectl get pods -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default workerpod 0/1 Error 0 4m23s 10.32.0.2 worker-node1 <none> <none> kube-system coredns-5644d7b6d9-l4jsd 1/1 Running 0 11m 10.32.0.4 k8s-master <none> <none> kube-system coredns-5644d7b6d9-q679h 1/1 Running 0 11m 10.32.0.3 k8s-master <none> <none> kube-system etcd-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none> kube-system kube-apiserver-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none> kube-system kube-controller-manager-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none> kube-system kube-proxy-ctgj8 1/1 Running 0 8m20s 172.16.0.114 worker-node2 <none> <none> kube-system kube-proxy-f78bm 1/1 Running 0 11m 172.16.0.76 k8s-master <none> <none> kube-system kube-proxy-ksk59 1/1 Running 0 8m28s 172.16.0.31 worker-node1 <none> <none> kube-system kube-scheduler-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none> kube-system weave-net-q2zwn 2/2 Running 0 9m35s 172.16.0.76 k8s-master <none> <none> kube-system weave-net-r9tzs 2/2 Running 0 8m28s 172.16.0.31 worker-node1 <none> <none> kube-system weave-net-tm8xx 2/2 Running 0 8m20s 172.16.0.114 worker-node2 <none> <none> (base) [root@k8s-master example]# cat nohup.out distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://172.16.0.76:40119 distributed.scheduler - INFO - Receive client connection: Client-df4caa18-0bc8-11ea-8e4c-12bd5ffa93ff distributed.core - INFO - Starting established connection (base) [root@k8s-master example]# kubectl describe pod workerpod Name: workerpod Namespace: default Priority: 0 Node: worker-node1/172.16.0.31 Start Time: Wed, 20 Nov 2019 19:06:36 +0000 Labels: app=dask dask.org/cluster-name=dask-root-99dcf768-4 dask.org/component=worker foo=bar user=root Annotations: <none> Status: Failed IP: 10.32.0.2 IPs: IP: 10.32.0.2 Containers: dask: Container ID: docker://578dc575fc263c4a3889a4f2cb5e06cd82a00e03cfc6acfd7a98fef703421390 Image: daskdev/dask:latest Image ID: docker-pullable://daskdev/dask@sha256:0a936daa94c82cea371c19a2c90c695688ab4e1e7acc905f8b30dfd419adfb6f Port: <none> Host Port: <none> Args: dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60 State: Terminated Reason: Error Exit Code: 1 Started: Wed, 20 Nov 2019 19:06:38 +0000 Finished: Wed, 20 Nov 2019 19:08:20 +0000 Ready: False Restart Count: 0 Limits: cpu: 2 memory: 6G Requests: cpu: 2 memory: 6G Environment: EXTRA_PIP_PACKAGES: fastparquet git+https://github.com/dask/distributed DASK_SCHEDULER_ADDRESS: tcp://172.16.0.76:40119 Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-p9f9v (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-p9f9v: Type: Secret (a volume populated by a Secret) SecretName: default-token-p9f9v Optional: false QoS Class: Guaranteed Node-Selectors: <none> Tolerations: k8s.dask.org/dedicated=worker:NoSchedule k8s.dask.org_dedicated=worker:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 5m47s default-scheduler Successfully assigned default/workerpod to worker-node1 Normal Pulled 5m45s kubelet, worker-node1 Container image "daskdev/dask:latest" already present on machine Normal Created 5m45s kubelet, worker-node1 Created container dask Normal Started 5m45s kubelet, worker-node1 Started container dask (base) [root@k8s-master example]# (base) [root@k8s-master example]# kubectl get events LAST SEEN TYPE REASON OBJECT MESSAGE 21m Normal Starting node/k8s-master Starting kubelet. 21m Normal NodeHasSufficientMemory node/k8s-master Node k8s-master status is now: NodeHasSufficientMemory 21m Normal NodeHasNoDiskPressure node/k8s-master Node k8s-master status is now: NodeHasNoDiskPressure 21m Normal NodeHasSufficientPID node/k8s-master Node k8s-master status is now: NodeHasSufficientPID 21m Normal NodeAllocatableEnforced node/k8s-master Updated Node Allocatable limit across pods 21m Normal RegisteredNode node/k8s-master Node k8s-master event: Registered Node k8s-master in Controller 21m Normal Starting node/k8s-master Starting kube-proxy. 18m Normal Starting node/worker-node1 Starting kubelet. 18m Normal NodeHasSufficientMemory node/worker-node1 Node worker-node1 status is now: NodeHasSufficientMemory 18m Normal NodeHasNoDiskPressure node/worker-node1 Node worker-node1 status is now: NodeHasNoDiskPressure 18m Normal NodeHasSufficientPID node/worker-node1 Node worker-node1 status is now: NodeHasSufficientPID 18m Normal NodeAllocatableEnforced node/worker-node1 Updated Node Allocatable limit across pods 18m Normal Starting node/worker-node1 Starting kube-proxy. 18m Normal RegisteredNode node/worker-node1 Node worker-node1 event: Registered Node worker-node1 in Controller 17m Normal NodeReady node/worker-node1 Node worker-node1 status is now: NodeReady 18m Normal Starting node/worker-node2 Starting kubelet. 18m Normal NodeHasSufficientMemory node/worker-node2 Node worker-node2 status is now: NodeHasSufficientMemory 18m Normal NodeHasNoDiskPressure node/worker-node2 Node worker-node2 status is now: NodeHasNoDiskPressure 18m Normal NodeHasSufficientPID node/worker-node2 Node worker-node2 status is now: NodeHasSufficientPID 18m Normal NodeAllocatableEnforced node/worker-node2 Updated Node Allocatable limit across pods 18m Normal Starting node/worker-node2 Starting kube-proxy. 17m Normal RegisteredNode node/worker-node2 Node worker-node2 event: Registered Node worker-node2 in Controller 17m Normal NodeReady node/worker-node2 Node worker-node2 status is now: NodeReady 14m Normal Scheduled pod/workerpod Successfully assigned default/workerpod to worker-node1 14m Normal Pulled pod/workerpod Container image "daskdev/dask:latest" already present on machine 14m Normal Created pod/workerpod Created container dask 14m Normal Started pod/workerpod Started container dask (base) [root@k8s-master example]#

(base) [root@k8s-master example]# kubectl exec -ti workerpod -- nslookup kubernetes.default OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "exec: \"nslookup\": executable file not found in $PATH": unknown command terminated with exit code 126

(base) [root@k8s-master example]# kubectl run dnsutils -it --rm=true --restart=Never --image=tutum/dnsutils cat /etc/resolv.conf nameserver 10.96.0.10 search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal options ndots:5 pod "dnsutils" deleted (base) [root@k8s-master example]# kubectl run dnsutils -it --restart=Never --image=tutum/dnsutils nslookup github.com If you don't see a command prompt, try pressing enter. ;; connection timed out; no servers could be reached pod default/dnsutils terminated (Error) (base) [root@k8s-master example]# kubectl logs dnsutils ;; connection timed out; no servers could be reached (base) [root@k8s-master example]#

(base) [root@k8s-master example]# kubectl exec -ti busybox -- nslookup kubernetes.default Server: 10.96.0.10 Address 1: 10.96.0.10 nslookup: can't resolve 'kubernetes.default' command terminated with exit code 1 (base) [root@k8s-master example]# kubectl get pods --namespace=kube-system -l k8s-app=kube-dns NAME READY STATUS RESTARTS AGE coredns-5644d7b6d9-l4jsd 1/1 Running 0 25h coredns-5644d7b6d9-q679h 1/1 Running 0 25h (base) [root@k8s-master example]# kubectl get pods --namespace=kube-system -l k8s-app=kube-dns NAME READY STATUS RESTARTS AGE coredns-5644d7b6d9-l4jsd 1/1 Running 0 25h coredns-5644d7b6d9-q679h 1/1 Running 0 25h (base) [root@k8s-master example]# for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done .:53 2019-11-20T19:01:42.161Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76 2019-11-20T19:01:42.161Z [INFO] CoreDNS-1.6.2 2019-11-20T19:01:42.161Z [INFO] linux/amd64, go1.12.8, 795a3eb CoreDNS-1.6.2 linux/amd64, go1.12.8, 795a3eb .:53 2019-11-20T19:01:41.862Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76 2019-11-20T19:01:41.862Z [INFO] CoreDNS-1.6.2 2019-11-20T19:01:41.862Z [INFO] linux/amd64, go1.12.8, 795a3eb CoreDNS-1.6.2 linux/amd64, go1.12.8, 795a3eb (base) [root@k8s-master example]# kubectl get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 26h (base) [root@k8s-master example]# kubectl get ep kube-dns --namespace=kube-system NAME ENDPOINTS AGE kube-dns 10.32.0.3:53,10.32.0.4:53,10.32.0.3:53 + 3 more... 26h (base) [root@k8s-master example]# kubectl -n kube-system edit configmap coredns Edit cancelled, no changes made.

[root@worker-node1 ec2-user]# nslookup 10.96.0.10 Server: 172.31.0.2 Address: 172.31.0.2#53 Non-authoritative answer: 10.0.96.10.in-addr.arpa name = ip-10-96-0-10.ec2.internal. Authoritative answers can be found from: [root@worker-node1 ec2-user]# nslookup 10.96.0.1 Server: 172.31.0.2 Address: 172.31.0.2#53 Non-authoritative answer: 1.0.96.10.in-addr.arpa name = ip-10-96-0-1.ec2.internal. Authoritative answers can be found from: [root@worker-node1 ec2-user]#

[root@worker-node2 ec2-user]# nslookup 10.96.0.10 Server: 172.31.0.2 Address: 172.31.0.2#53 Non-authoritative answer: 10.0.96.10.in-addr.arpa name = ip-10-96-0-10.ec2.internal. Authoritative answers can be found from: [root@worker-node2 ec2-user]# nslookup 10.96.0.1 Server: 172.31.0.2 Address: 172.31.0.2#53 Non-authoritative answer: 1.0.96.10.in-addr.arpa name = ip-10-96-0-1.ec2.internal. Authoritative answers can be found from:

1条回答

网友

1楼 · 发布于 2024-05-19 05:06:30

更新：

正如我从评论中看到的，我还需要在答案中涵盖一些基本概念。在

确保Bernetes能够很好地满足集群的一些要求。在

所有Kubernetes节点都必须有full network connectivity。
这意味着任何集群节点都应该能够使用任何网络协议和任何端口（在tcp/udp的情况下）与任何其他集群节点通信，而不需要NAT。一些云环境需要额外的自定义防火墙规则来实现这一点。Calico example
Kubernetes Pods应该能够与其他节点上调度的pod进行通信。
此功能由[CNI network add-on]提供。大多数流行的附加组件需要Kubernetes控制平面中的附加选项，通常由kubeadm init pod-network-cidr=a.b.c.d/16命令行选项设置。请注意，不同网络加载项的默认IP子网是不同的。
如果要为特定网络加载项使用自定义Pod子网，则必须在将网络加载项部署YAML文件应用到群集之前自定义该文件。
通过将ICMP或curl请求从node CLI或Pod CLI发送到另一个节点上调度的Pod的任何IP地址，可以很容易地测试Pod之间的连接。注意，服务ClusterIP不响应ICMP请求，因为它只不过是一组iptables转发规则而已。可以使用以下命令显示具有节点名的pod的完整列表：
```
kubectl get pods  all-namespaces -o wide
```
对于服务发现功能，必须在Kubernetes群集中运行DNS服务。
通常情况下，kubedns适用于Kubernetes v1.9之前的版本，而{}适用于较新的集群。
Kubernetes DNS服务通常包含一个包含两个副本的部署和一个默认IP地址为10.96.0.10的ClusterIP服务。在

查看问题中的数据，我怀疑您可能对网络附加组件有问题。我将使用以下命令对其进行测试，这些命令将在正常群集上返回成功的结果：

^{pr2}$

网络插件故障排除是一个相当大的知识，写在一个答案，所以，如果你需要修复网络插件，请做一些搜索现有的答案，并提出另一个问题，如果你没有找到合适的。在

下面介绍如何在Kubernetes集群中检查DNS服务：

在dnsPolicyClusterFirst（默认值）的情况下，任何与配置的群集域后缀不匹配的DNS查询，例如“www.kubernetes.io“”被转发到从节点继承的上游名称服务器。在

如何检查群集节点上的DNS客户端配置：

$ cat /etc/resolv.conf
$ systemd-resolve  status

如何检查节点上的DNS客户端是否正常工作：

$ nslookup github.com

Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.118.4

如何获取Kubernetes群集DNS配置：

$ kubectl get svc,pods -n kube-system -o wide | grep dns 

service/kube-dns        ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   185d   k8s-app=kube-dns

pod/coredns-fb8b8dccf-5jjv8                     1/1     Running   122        185d   10.244.0.16   kube-master2   <none>           <none>
pod/coredns-fb8b8dccf-5pbkg                     1/1     Running   122        185d   10.244.0.17   kube-master2   <none>           <none>

$ kubectl get configmap coredns -n kube-system -o yaml

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2019-05-20T16:10:42Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "1657005"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: d1598034-7b19-11e9-9137-42010a9c0004

如何检查群集DNS服务（coredns）是否正常工作：

$ nslookup github.com 10.96.0.10

Server:         10.96.0.10
Address:        10.96.0.10#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.118.4

如何检查常规pod是否可以解析特定DNS名称：

$ kubectl run dnsutils -it  rm=true  restart=Never  image=tutum/dnsutils cat /etc/resolv.conf

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local 
options ndots:5

$ kubectl run dnsutils -it  rm=true  restart=Never  image=tutum/dnsutils nslookup github.com

Server:         10.96.0.10
Address:        10.96.0.10#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.118.3

有关DNS疑难解答的详细信息，请参阅official documentation

相关问题更多 >

编程相关推荐

热门问题

热门文章