我在kubernetes集群(EKS)中配置了一个elasticsearch集群,elasticsearch集群有3个节点,我已经为节点设置了一个8E磁盘来存储数据。(认为我暂时不会有任何空间问题)
[root@es-cluster-0 elasticsearch]# curl -s -XGET http://localhost:9200/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
36 66.7gb 966.1gb 8191.9pb 8191.9pb 0 10.65.32.184 10.65.32.184 es-cluster-0
33 82.6gb 966.1gb 8191.9pb 8191.9pb 0 10.65.32.202 10.65.32.202 es-cluster-2
37 76gb 966.1gb 8191.9pb 8191.9pb 0 10.65.32.178 10.65.32.178 es-cluster-1
14 UNASSIGNED集群当前的健康状况是:
[root@es-cluster-0 elasticsearch]# curl -s -XGET http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "k8s-logs",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 56,
"active_shards" : 106,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 14,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 88.33333333333333
}我可以看到,我有14个"unassigned_shards",它与上面/_cat/allocation的最后一行完全匹配。
当我开始弄清楚发生了什么事时,我发现:
[root@es-cluster-0 elasticsearch]# curl -s -XGET http://localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "logstash-2022.01.22",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2022-01-22T00:00:11.254Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [bf_GjmcUQGuCTk-_voh4Xw]: failed recovery, failure RecoveryFailedException[[logstash-2022.01.22][0]: Recovery failed from {es-cluster-0}{hYJ4ifx7R7yWJq6VFP3Drw}{jjAAtdcmQXeVpJXxj4DYcA}{10.65.32.184}{10.65.32.184:9300}{dilmrt}{ml.machine_memory=15878057984, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {es-cluster-1}{bf_GjmcUQGuCTk-_voh4Xw}{QNp4DD51TQa716D4TjMFPg}{10.65.32.178}{10.65.32.178:9300}{dilmrt}{ml.machine_memory=15878057984, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[es-cluster-0][10.65.32.184:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[es-cluster-1][10.65.32.178:9300][internal:index/shard/recovery/clean_files]]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: java.io.IOException: Disk quota exceeded]; nested: IOException[Disk quota exceeded]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "7WHft5LVTYCEWvwKM64A-w",
"node_name" : "es-cluster-2",
"transport_address" : "10.65.32.202:9300",
"node_attributes" : {
"ml.machine_memory" : "15878057984",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
--- TRUNCATED ---我不知道为什么要说Disk quota exceeded,如果elasticsearch集群正确地报告了它的可用容量,那么/_cat/allocation还有什么额外的配置需要设置,以便告诉elasticsearch,我们有足够的空间可以使用?
发布于 2022-01-28 15:39:46
有关可能导致磁盘配额错误的EFS限制,请参阅此处,该错误与磁盘大小无关。一般来说,EFS不支持相当大的ES堆栈,例如elasticsearch期望每个数据节点实例有64K文件描述符,但EFS目前只支持32K。如果您查看您的elasticsearch日志,可能会发现哪些限制已经违反了。
https://stackoverflow.com/questions/70894538
复制相似问题