我按照https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/standalone/kubernetes/的指示设置会话模式flink集群,如“启动Kubernetes集群(会话模式)”一节中使用以下命令:
# Configuration and service definition
$ kubectl create -f flink-configuration-configmap.yaml
$ kubectl create -f jobmanager-service.yaml
# Create the deployments for the cluster
$ kubectl create -f jobmanager-session-deployment.yaml
$ kubectl create -f taskmanager-session-deployment.yaml任务经理吊舱一直在坠毁。任务管理器日志显示:
2022-02-28 03:41:40,543 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@flink-jobmanager:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@flink-jobmanager:6123]] Caused by: [java.net.UnknownHostException: flink-jobmanager: Temporary failure in name resolution]
2022-02-28 03:41:40,555 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*.
2022-02-28 03:42:00,584 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*.
2022-02-28 03:42:10,594 INFO akka.remote.transport.ProtocolStateActor [] - No response from remote for outbound association. Associate timed out after [20000 ms].
2022-02-28 03:42:10,596 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@flink-jobmanager:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@flink-jobmanager:6123]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].]
2022-02-28 03:42:10,605 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*.
2022-02-28 03:42:30,644 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*.
2022-02-28 03:42:40,052 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Fatal error occurred in TaskExecutor akka.tcp://flink@10.244.205.198:6122/user/rpc/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1449) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$17(TaskExecutor.java:1434) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:455) ~[flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:455) ~[flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) ~[flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) ~[flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.actor.Actor.aroundReceive(Actor.scala:537) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.actor.Actor.aroundReceive$(Actor.scala:535) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.actor.ActorCell.invoke(ActorCell.scala:548) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.dispatch.Mailbox.run(Mailbox.scala:231) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [flink-rpc-akka_dcfe8153-9945-448e-897a-6dec4f3d2704.jar:1.14.3]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_322]
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_322]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_322]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [?:1.8.0_322]
2022-02-28 03:42:40,066 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error occurred while executing the TaskManager. Shutting it down...此错误是否意味着任务管理器无法将flink-作业管理器解析为flink-作业管理器服务群集ip?
flink-工作经理服务已经结束:
(base) ~/cloudmap3/cloudmap3-k8s/flink $ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
flink-jobmanager ClusterIP 10.111.160.112 <none> 6123/TCP,6124/TCP,8081/TCP 83m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 32d如何调试此问题?
添加一些更多的信息: coreDNS吊舱没有显示任何意义。豆荚上的木头显示:
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration MD5 = 08e2b174e0f0a30a2e82df9c995f4a34
CoreDNS-1.8.4
linux/amd64, go1.16.4, 053c4d5
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/health: Local health request to "http://:8080/health" failed: Get "http://:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARNING] plugin/health: Local health request to "http://:8080/health" took more than 1s: 2.607635169s
[WARNING] plugin/health: Local health request to "http://:8080/health" failed: Get "http://:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARNING] plugin/health: Local health request to "http://:8080/health" failed: Get "http://:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARNING] plugin/health: Local health request to "http://:8080/health" failed: Get "http://:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARNING] plugin/health: Local health request to "http://:8080/health" took more than 1s: 1.885799651s发布于 2022-03-02 09:51:26
重新启动coredns pod后,来自coredns pod [INFO] plugin/ready: Still waiting on: "kubernetes"的错误将消失,flink任务管理器可以向作业管理器注册。
https://stackoverflow.com/questions/71290534
复制相似问题