首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >喷气式机群节点间歇性下降和恢复

喷气式机群节点间歇性下降和恢复
EN

Stack Overflow用户
提问于 2020-09-15 21:08:19
回答 1查看 387关注 0票数 0

我有一个由15个节点组成的机场集群。这个集群在正常负载10k TPS下表现相当好。我今天做了一些测试,在一个更高的TPS上。我把TPS提高到了130 K-150 000。我观察到一些节点间歇性下降,几秒钟后自动恢复。由于这些节点下降,我们将得到心跳超时,因此,读取超时。

一个集群节点配置:8个核心。120 RAM内存我在内存中存储数据。所有节点都有足够的剩余空间。在1.2TB (15*120)的总体集群空间中,只有275 GB的空间被用完。另外,网络在一点也不稀奇。所有这些机器都在一个数据中心,都是高带宽的机器。

监测AMC所提出的一些意见:

  1. 看到一些节点(大约5-6)在几秒钟内处于不活动状态--
  2. --在这些节点中很少有几个节点出现了大量的客户机连接。例如:在所有其他节点上都有600-7000个客户端连接。其中一个节点具有不寻常的25000个客户端连接。

群集节点中的一些错误日志:

代码语言:javascript
复制
Sep 15 2020 17:00:43 GMT: WARNING (hb): (hb.c:4864) (repeated:5) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:43 GMT: WARNING (socket): (socket.c:808) (repeated:5) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:740) (repeated:3) Timeout while connecting
Sep 15 2020 17:00:53 GMT: WARNING (hb): (hb.c:4864) (repeated:3) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:808) (repeated:3) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting
Sep 15 2020 17:01:03 GMT: WARNING (hb): (hb.c:4864) (repeated:1) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:808) (repeated:1) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:740) (repeated:2) Timeout while connecting
Sep 15 2020 17:01:13 GMT: WARNING (hb): (hb.c:4864) (repeated:2) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:808) (repeated:2) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:740) Timeout while connecting
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:808) Error while connecting socket to 10.33.54.144:2057
Sep 15 2020 17:02:44 GMT: WARNING (hb): (hb.c:4864) could not create heartbeat connection to node {10.33.54.144:2057}
Sep 15 2020 17:02:53 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting

我们还在正在下降的节点中看到了其中一些错误日志:

代码语言:javascript
复制
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9280f220a0102 on fd 4155 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b676220a0102 on fd 4149 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9fbd6200a0102 on fd 42 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb96d3d220a0102 on fd 4444 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb99036210a0102 on fd 4278 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9f102220a0102 on fd 4143 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb91822210a0102 on fd 4515 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9e5ff200a0102 on fd 4173 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb93f65200a0102 on fd 38 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9132f220a0102 on fd 4414 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb939be210a0102 on fd 4567 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b19a220a0102 on fd 4165 failed : Broken pipe

在这里附加aerospike.conf文件:

代码语言:javascript
复制
service {
    user root
    group root
    service-threads 12
    transaction-queues 12
    transaction-threads-per-queue 4
    proto-fd-max 50000
    migrate-threads 1
    pidfile /var/run/aerospike/asd.pid
}

logging {
        file /var/log/aerospike/aerospike.log {
        context any info
        context migrate debug
        }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode mesh
        port 2057

        mesh-seed-address-port 10.34.154.177 2057
        mesh-seed-address-port 10.34.15.40 2057
        mesh-seed-address-port 10.32.255.229 2057
        mesh-seed-address-port 10.33.54.144 2057
        mesh-seed-address-port 10.32.190.157 2057
        mesh-seed-address-port 10.32.101.63 2057
        mesh-seed-address-port 10.34.2.241 2057
        mesh-seed-address-port 10.32.214.251 2057
        mesh-seed-address-port 10.34.30.114 2057
        mesh-seed-address-port 10.33.162.134 2057
        mesh-seed-address-port 10.33.190.57 2057
        mesh-seed-address-port 10.34.61.109 2057
        mesh-seed-address-port 10.34.47.19 2057
        mesh-seed-address-port 10.33.34.24 2057
        mesh-seed-address-port 10.34.118.182 2057
        
        interval 150
        timeout 20
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace PS1 {
    replication-factor 2
    memory-size 70G
    single-bin false
    data-in-index false
    storage-engine memory   
    stop-writes-pct 90
    high-water-memory-pct 75    
}

namespace LS1 {
    replication-factor 2
    memory-size 30G
    single-bin false
    data-in-index false
    storage-engine memory   
    stop-writes-pct 90
    high-water-memory-pct 75
}

对此有什么可能的解释吗?

EN

回答 1

Stack Overflow用户

发布于 2020-09-15 22:34:29

似乎节点在如此高的吞吐量上存在网络连接问题。这可能有不同的根本原因,从简单的网络相关瓶颈(带宽、每秒数据包)到节点本身的某些东西妨碍了与网络的适当接口(软中断激增、网络队列的不适当分布、CPU崩溃)。这将阻止心跳连接/消息通过,从而导致节点离开集群,直到集群恢复。如果运行在云/虚拟化环境中,一些主机的邻居可能比其他主机更吵,等等.

连接数量的增加是一种症状,因为节点上的任何减速都会通过增加吞吐量(这将增加连接的数量,这也会导致向下螺旋效应)使客户端进行补偿。

最后,离开或加入集群的单个节点不应对读取事务产生太大影响。检查您的政策,确保您有socketTimeout / totalTimeout / maxRetries等.设置正确,以便读取可以快速重试对不同的副本。

本文可以帮助您了解最新的一点:https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852/3

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/63909864

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档