未能将对象封送到TFJob;规范无效:未能将对象封送到TFJob

2024-10-03 00:24:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我对kubernetes和tensorflow都比较陌生,尝试从这个链接(https://github.com/learnk8s/distributed-tensorflow-on-k8s)运行基本的kubeflow分布式tensorflow示例。我目前运行的是本地裸机kubernetes集群,有2个节点(1个主节点和1个工作节点)。当我在minikube中运行它时,一切都很好(遵循文档),无论是训练还是服务运行都很成功。但是在本地集群上运行作业会给我带来这个错误!在

任何帮助都将不胜感激。在

对于这个设置,我为nfs存储创建了一个pod,它将由作业使用。因为本地集群没有启用动态配置,所以我手动创建了持久卷(使用的文件是附加的)。在

Nfs机架存储文件:

kind: Service
apiVersion: v1
metadata:
  name: nfs-service
spec:
  selector:
    role: nfs-service
  ports:
    # Open the ports required by the NFS server
    - name: nfs
      port: 2049
    - name: mountd
      port: 20048
    - name: rpcbind
      port: 111
---

kind: Pod
apiVersion: v1
metadata:
  name: nfs-server-pod
  labels:
    role: nfs-service
spec:
  containers:
    - name: nfs-server-container
      image: cpuguy83/nfs-server
      securityContext:
        privileged: true
      args:
        # Pass the paths to share to the Docker image
        - /exports

永久卷和PVC文件:

^{pr2}$

TFJob文件:

apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
  name: tfjob1
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              args:
                - --model_dir
                - ./out/vars
                - --export_dir
                - ./out/models
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
              args:
                - --model_dir
                - ./out/vars
                - --export_dir
                - ./out/models
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure

当我运行作业时,它会给我这个错误

error: unable to recognize "kube/tfjob.yaml": no matches for kind "TFJob" in version "kubeflow.org/v1alpha1"

在搜索了一下之后,有人指出“v1alpha1”可能已经过时了,所以您应该使用“v1beta1”(奇怪的是,这个“v1alpha1”正在使用我的minikube设置,所以我非常困惑!)。但是,尽管已经创建了tfjob,但是我没有看到任何新的容器在启动,而minikube运行则相反,在minikube中,新的pod成功地开始和结束。当我描述Tfjob时,我看到了这个错误

 Type     Reason            Age   From         Message
  ----     ------            ----  ----         -------
  Warning  InvalidTFJobSpec  22s   tf-operator  Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob"

因为唯一的区别是nfs存储,我想我的手动设置可能有问题。如果我因为没有足够的背景而把事情搞砸了,请告诉我!在


Tags: thetonameimagetensorflowoutspecvolume
1条回答
网友
1楼 · 发布于 2024-10-03 00:24:24

我找到了导致特定错误的问题。首先,api版本发生了变化,所以我不得不从v1alpha1移到{}。第二,我遵循的教程使用的是kubeflowv0.1.2(相当旧),在yaml文件中定义tfjob的语法从此改变了(不能确切地确定更改发生在哪个版本中!)。因此,通过查看git中的最新示例,我可以更新作业规范!在

教程版本:

apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
  name: tfjob1
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              args:
                -  model_dir
                - ./out/vars
                -  export_dir
                - ./out/models
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure

更新版本:

^{pr2}$

相关问题 更多 >