大约两天前,GKE的伦敦数据中心(https://status.cloud.google.com/incident/compute/20013)发生了故障,从那时起,我的一个节点就出现故障。我不得不手动终止运行在它上的许多豆荚,而且我对几个站点都有问题,我认为由于它们的活性检查暂时失败,这可能与gke-metrics-agent中的以下错误有关
查看系统荚,我可以看到gke-metrics-agent的一个实例处于终止状态,并且自昨晚以来一直处于这样的状态:
kubectl get pods -n kube-system报告:
...
gke-metrics-agent-k47g8 0/1 Terminating 0 32d
gke-metrics-agent-knr9h 1/1 Running 0 31h
gke-metrics-agent-vqkpw 1/1 Running 0 32d
...我看过豆荚的描述输出,但是找不到任何帮助我理解它需要做什么的东西:
kubectl describe pod gke-metrics-agent-k47g8 -n kube-systemName: gke-metrics-agent-k47g8
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: <node-name>/<IP>
Start Time: Mon, 09 Nov 2020 03:41:14 +0000
Labels: component=gke-metrics-agent
controller-revision-hash=f8c5b8bfb
k8s-app=gke-metrics-agent
pod-template-generation=4
Annotations: components.gke.io/component-name: gke-metrics-agent
components.gke.io/component-version: 0.27.1
configHash: <config-hash>
Status: Terminating (lasts 15h)
Termination Grace Period: 30s
IP: <IP>
IPs:
IP: <IP>
Controlled By: DaemonSet/gke-metrics-agent
Containers:
gke-metrics-agent:
Container ID: docker://<id>
Image: gcr.io/gke-release/gke-metrics-agent:0.1.3-gke.0
Image ID: docker-pullable://gcr.io/gke-release/gke-metrics-agent@sha256:<hash>
Port: <none>
Host Port: <none>
Command:
/otelsvc
--config=/conf/gke-metrics-agent-config.yaml
--metrics-level=NONE
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Nov 2020 03:41:17 +0000
Finished: Thu, 10 Dec 2020 21:16:50 +0000
Ready: False
Restart Count: 0
Limits:
memory: 50Mi
Requests:
cpu: 3m
memory: 50Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
POD_NAME: gke-metrics-agent-k47g8 (v1:metadata.name)
POD_NAMESPACE: kube-system (v1:metadata.namespace)
KUBELET_HOST: 127.0.0.1
ARG1: ${1}
ARG2: ${2}
Mounts:
/conf from gke-metrics-agent-config-vol (rw)
/etc/ssl/certs from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from gke-metrics-agent-token-cn6ss (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
gke-metrics-agent-config-vol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gke-metrics-agent-conf
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType:
gke-metrics-agent-token-cn6ss:
Type: Secret (a volume populated by a Secret)
SecretName: gke-metrics-agent-token-cn6ss
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoExecute
:NoSchedule
components.gke.io/gke-managed-components
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events: <none>我不习惯于处理系统吊舱,在过去,我的经验是,当其他所有故障都失败时,我的故障排除问题往往会被强制删除:
kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force我担心的是,我不完全理解这可能会对一个系统吊舱做什么,并希望有人拥有专业知识,可以建议一个明智的方式前进?
我也在考虑排水这个节点,这样Kubernetes就可以重建一个新的节点。这可能是最简单的方法吗?
发布于 2020-12-11 20:00:21
在此之后,我发现,随着时间的推移,体验gke-metrics-agent问题的吊舱变得更加不稳定。
因此,我不得不把它抽干。它运行的资源现在位于按预期工作的新节点上,所有系统荚都按预期运行(包括gke-metrics-agent)。
在我确保耗尽此节点之前,Pod中断预算已经到位,因为在1或2个实例上运行了许多服务:
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
这意味着我可以跑:
kubectl drain <node-name>然后,部署确保在坏节点离线之前有足够的活动舱,并且似乎已经避免了任何停机时间。
https://stackoverflow.com/questions/65251513
复制相似问题