我们在生产中有一个由4个节点组成的集群。我们观察到,其中一个节点遇到了一种情况,它不断地收缩和扩展ISR超过1个小时,并且无法恢复,直到代理被弹回。
[2017-02-21 14:52:16,518] INFO Partition [skynet-large-stage,5] on broker 0: Shrinking ISR for partition [skynet-large-stage,5] from 2,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,543] INFO Partition [skynet-large-stage,37] on broker 0: Shrinking ISR for partition [skynet-large-stage,37] from 1,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,544] INFO Partition [skynet-large-stage,13] on broker 0: Shrinking ISR for partition [skynet-large-stage,13] from 1,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,545] INFO Partition [__consumer_offsets,46] on broker 0: Shrinking ISR for partition [__consumer_offsets,46] from 3,2,0 to 3,0 (kafka.cluster.Partition)
.
.我想知道是什么导致了这个问题,为什么破产的经纪人没有被踢出ISR。
Kafka版本为0.10.1.0
发布于 2018-10-16 01:12:56
KAFKA-4477中的bug得到了修复,但总的来说,当Kafka brokers在与zookeeper节点对话时超时(默认为6000ms超时)时,我看到了同样的问题,对于一些短暂的网络故障,它们被踢出集群,分区领导变化,客户端必须重新平衡,等等。对于大容量集群,这是一个痛苦。
简单地增加这个超时时间已经帮了我好几倍:
zookeeper.session.timeout.ms官方文档默认为6000ms。我发现简单地将其增加到15000ms会导致集群变得坚如磐石。
0.11.0 Kafka版本文档:https://kafka.apache.org/0110/documentation.html
https://stackoverflow.com/questions/42364331
复制相似问题