我们使用client-go来创建kubernetes作业和部署。今天在我们的一个集群(kubernetes v1.18.19)中,我遇到了下面奇怪的问题。
kubernetes Job的Pod一直处于挂起状态,没有任何原因。kubectl describe pod显示没有事件。从主机创建作业(通过kubectl)是正常的,pod最终开始运行。
令我惊讶的是创建部署是好的,pod最终会运行起来!它不仅适用于Kubernetes Jobs。为什么?如何解决这个问题?我能做什么??我在这里花了几个小时,但没有任何进展。
kubeconfig by client-go:
Mount from host machine, path: /root/.kube/configkubectl描述工作展示:
Name: unittest
Namespace: default
Selector: controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
Labels: job-id=unittest
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Sat, 19 Jun 2021 00:20:12 +0800
Pods Statuses: 1 Running / 0 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
job-name=unittest
Containers:
unittest:
Image: ubuntu:18.04
Port: <none>
Host Port: <none>
Command:
echo hello
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 21m job-controller Created pod: unittest-tt5b2Kubectl describe on target pod show:
Name: unittest-tt5b2
Namespace: default
Priority: 0
Node: <none>
Labels: controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
job-name=unittest
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: Job/unittest
Containers:
unittest:
Image: ubuntu:18.04
Port: <none>
Host Port: <none>
Command:
echo hello
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-72g27 (ro)
Volumes:
default-token-72g27:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-72g27
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none> kubectl get events显示:
55m Normal ScalingReplicaSet deployment/job-scheduler Scaled up replica set job-scheduler-76b7465d74 to 1
19m Normal ScalingReplicaSet deployment/job-scheduler Scaled up replica set job-scheduler-74f8896f48 to 1
58m Normal SuccessfulCreate job/unittest Created pod: unittest-pp665
49m Normal SuccessfulCreate job/unittest Created pod: unittest-xm6ck
17m Normal SuccessfulCreate job/unittest Created pod: unittest-tt5b2发布于 2021-06-20 08:51:58
我修复了这个问题。
我们为NPU设备使用自定义调度器,为GPU设备使用默认调度器。对于GPU设备,调度器的名称是"default- scheduler“而不是"default”。我为这些kube Job传递了"default“,这会导致pod挂起。
https://stackoverflow.com/questions/68038670
复制相似问题