首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >卡夫卡坚持重新分配分区工具和进展

卡夫卡坚持重新分配分区工具和进展
EN

Stack Overflow用户
提问于 2018-12-27 17:53:14
回答 1查看 3.6K关注 0票数 1

运行重新分配分区工具,在Docker上将分区扩展到5个代理上,而不是5.Kafka 2.1。

它到了一个节点不正常行为的点。其他(健康)节点开始显示以下消息:

代码语言:javascript
复制
[2018-12-27 13:00:31,618] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=48303608, epoch=226826) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2018-12-27 13:00:31,620] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={impressions-35=(offset=3931626, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-26=(offset=4273048, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-86=(offset=3660830, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-93=(offset=2535787, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[26]), impressions-53=(offset=3683354, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-59=(offset=3696315, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-11=(offset=3928338, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-69=(offset=2510463, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-72=(offset=2481181, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-75=(offset=2462527, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-126=(offset=2510344, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-63=(offset=2515896, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=48303608, epoch=226826)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
    at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
    at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
    at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
    at scala.Option.foreach(Option.scala:257)
    at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

15分钟后,健康的服务器显示以下消息:

代码语言:javascript
复制
[2018-12-27 13:16:00,540] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition events-111 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)

稍后,我们可以看到很多这样的消息:

代码语言:javascript
复制
[2018-12-27 17:20:21,132] WARN [ReplicaManager broker=1] While recording the replica LEO, the partition events-116 hasn't been created. (kafka.server.ReplicaManager)

其中最常见的一套是:

代码语言:javascript
复制
[2018-12-27 17:20:21,138] WARN [ReplicaManager broker=1] Leader 1 failed to record follower 3's position 2517140 since the replica is not recognized to be one of the ass

点燃1,4,6分区事件的副本-53。将返回此分区的空记录。(kafka.server.ReplicaManager)

重新分配的主题在3个服务器中有128个分区。总之,每个服务器大约有2000个分区。

现在,重新分配被困了6个小时,显示出41%的分区未被复制。它有复制3,虽然它现在有了复制5。我想这是在下面进行再平衡的一部分,以便增加这些副本,然后删除那些不需要的副本?

然而,节点3显示了以下消息:

代码语言:javascript
复制
[2018-12-27 17:10:05,509] WARN [RequestSendThread controllerId=3] Controller 3 epoch 14 fails to send request (type=LeaderAndIsRequest, controllerId=3, controllerEpoch=14, partitionStates={events-125=PartitionState(controllerEpoch=14, leader=1, leaderEpoch=25, isr=3,1,2, zkVersion=57, replicas=1,6,2,3, isNew=false)}, liveLeaders=(172.31.10.35:9092 (id: 1 rack: eu-west-1c))) to broker 172.31.27.111:9092 (id: 3 rack: eu-west-1a). Reconnecting to broker. (kafka.controller.RequestSendThread)

所以,节点"3“有问题--我怎么知道它发生了什么?

这已经发生了两次,我们尝试在相同分区大小的两个主题中重新分配分区。在前面的例子中,我们使用相同的Id作为一个新代理(重新启动容器没有帮助)启动另一台机器,并将其恢复。但是,如何才能避免这种情况发生呢?

根本原因是什么?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-03-27 15:05:53

这封信写到现在已经有一段时间了。但是,如果它对任何人都有帮助的话,我认为帮助您更改的设置是:将zookeeper.session.timeout.mszookeeper.connection.timeout.msreplica.lag.time.max.ms (在我们的例子中)增加为60000

从那时起,这种情况就再也没有发生过。背后的想法是,在某个时候,有一个经纪人失去了ZK会话,这在认为经纪人还活着的经纪人和认为没有活着的ZK之间造成了错误的联系。因为某种原因,这里从来没有被清理过。增加这些设置允许更长的会话粘性时间。当心,真正死去的经纪商也需要更长的时间才能到期。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/53948986

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档