我正在尝试解决我们的K8集群v1.19中的DNS问题。有3个节点(1个控制器,2个工作节点)都在运行vanilla Ubuntu20.04,使用Calico网络和Metallb实现入站负载均衡。这一切都是在内部托管的,并且可以完全访问互联网。它前面还有一个代理服务器(Traefik),负责处理K8集群和基础设施中的其他服务的SSL。
当我升级了连接到redis pod的pod的舵表时,这个问题就发生了,但是在过去的36天里,我很乐意运行它。
在其中一个pod的日志中,显示无法确定redis pod所在位置的错误:
2020-11-09 00:00:00 [1] [verbose]: [Cache] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]: [Cache] Successfully connected to redis.
2020-11-09 00:00:00 [1] [verbose]: [PubSub] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]: [PubSub] Successfully connected to redis.
2020-11-09 00:00:00 [1] [warn]: Secret key is weak. Please consider lengthening it for better security.
2020-11-09 00:00:00 [1] [verbose]: [Database] Connecting to database...
2020-11-09 00:00:00 [1] [info]: [Database] Successfully connected .
2020-11-09 00:00:00 [1] [verbose]: [Database] Ran 0 migration(s).
2020-11-09 00:00:00 [1] [verbose]: Sending request for public key.
Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26) {
errno: -3001,
code: 'EAI_AGAIN',
syscall: 'getaddrinfo',
hostname: 'oct-2020-redis-master'
}
[ioredis] Unhandled error event: Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)
Error: connect ETIMEDOUT
at Socket.<anonymous> (/app/node_modules/ioredis/built/redis/index.js:307:37)
at Object.onceWrapper (events.js:421:28)
at Socket.emit (events.js:315:20)
at Socket.EventEmitter.emit (domain.js:486:12)
at Socket._onTimeout (net.js:483:8)
at listOnTimeout (internal/timers.js:554:17)
at processTimers (internal/timers.js:497:7) {
errorno: 'ETIMEDOUT',
code: 'ETIMEDOUT',
syscall: 'connect'
}我已经完成了https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/中概述的步骤
ubuntu@k8-01:~$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default
;; connection timed out; no servers could be reached
command terminated with exit code 1
ubuntu@k8-01:~$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-lfm5t 1/1 Running 17 37d
coredns-f9fd979d6-sw2qp 1/1 Running 18 37d
ubuntu@k8-01:~$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 10.244.210.238:34288 - 28733 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001300712s
[INFO] 10.244.210.238:44532 - 12032 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001279312s
[INFO] 10.244.210.235:44595 - 65094 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000163001s
[INFO] 10.244.210.235:55945 - 20758 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000141202s
ubuntu@k8-01:~$ kubectl get services --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default oct-2020-api ClusterIP 10.107.89.213 <none> 80/TCP 37d
default oct-2020-nginx-ingress-controller LoadBalancer 10.110.235.175 192.168.2.150 80:30194/TCP,443:31514/TCP 37d
default oct-2020-nginx-ingress-default-backend ClusterIP 10.98.147.246 <none> 80/TCP 37d
default oct-2020-redis-headless ClusterIP None <none> 6379/TCP 37d
default oct-2020-redis-master ClusterIP 10.109.58.236 <none> 6379/TCP 37d
default oct-2020-webclient ClusterIP 10.111.204.251 <none> 80/TCP 37d
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 37d
kube-system coredns NodePort 10.101.104.114 <none> 53:31245/UDP 15h
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 37d当我进入pod时:
/app # grep "nameserver" /etc/resolv.conf
nameserver 10.96.0.10
/app # nslookup
BusyBox v1.31.1 () multi-call binary.
Usage: nslookup [-type=QUERY_TYPE] [-debug] HOST [DNS_SERVER]
Query DNS about HOST
QUERY_TYPE: soa,ns,a,aaaa,cname,mx,txt,ptr,any
/app # ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
^C
--- 10.96.0.10 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
/app # nslookup oct-20-redis-master
;; connection timed out; no servers could be reached任何关于故障排除的想法都将不胜感激。
发布于 2020-11-10 11:08:43
为了回答我自己的问题,我删除了DNS pod,然后它再次工作。命令如下:
kubectl delete pod coredns-f9fd979d6-sw2qp --namespace=kube-system这并没有涉及到为什么会发生这种情况的根本问题,或者为什么K8没有检测到这些pods有问题并重新创建它们。我将继续深入研究这一问题,并在DNS pod上添加更多工具,以查看导致此问题的实际原因。
发布于 2020-11-09 13:20:16
这就是我们测试dns的方法
在部署下创建
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
labels:
app: nginx
spec:
serviceName: "nginx"
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumes:
- name: www
emptyDir:运行以下测试
master $ kubectl get po
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 1m
web-1 1/1 Running 0 1m
master $ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 35m
nginx ClusterIP None <none> 80/TCP 2m
master $ kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm
If you don't see a command prompt, try pressing enter.
/ # nslookup nginx
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: nginx
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local
Address 2: 10.40.0.2 web-1.nginx.default.svc.cluster.local
/ #
/ # nslookup web-0.nginx
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: web-0.nginx
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local
/ # nslookup web-0.nginx.default.svc.cluster.local
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: web-0.nginx.default.svc.cluster.local
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.localhttps://stackoverflow.com/questions/64744499
复制相似问题