首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >伽雷拉星团一起崩塌?

伽雷拉星团一起崩塌?
EN

Database Administration用户
提问于 2021-09-23 06:50:50
回答 1查看 338关注 0票数 0

你好,我的galera星系团目前有问题。它早些时候运行良好,但突然所有节点都崩溃了。我在云中有3个本地节点和1个节点。

在下降过程中,日志显示了以下内容:

代码语言:javascript
复制
2021-09-23  8:24:24 1 [Note] WSREP: ================================================
View:
  id: d326832d-56e2-11eb-80c1-760504343273:9842007
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(3):
        0: 077cec27-1b75-11ec-842c-5f218e28b692, Strike
        1: 25c6e10d-1b75-11ec-9ff7-de60adb87197, unspecified
        2: 4e87b761-1b96-11ec-b561-272e3101cb38, Duel
=================================================
2021-09-23  8:24:24 1 [Note] WSREP: Non-primary view
2021-09-23  8:24:24 1 [Note] WSREP: Server status change connected -> connected
2021-09-23  8:24:24 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:24:24 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:24:25 0 [Note] WSREP: (077cec27-842c, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.10.10:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1166000 rttvar: 583000 rto: 3498000 lost: 0 last_data_recv: 1835 cwnd: 10 last_queued_since: 1835197309 last_delivered_since: 1835197309 send_queue_length: 0 send_queue_bytes: 0
           **This happened for ~90 retries**
2021-09-23  8:29:42 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
2021-09-23  8:29:42 0 [Note] WSREP: Shutdown replication
2021-09-23  8:29:42 0 [Note] WSREP: Server status change connected -> disconnecting
2021-09-23  8:29:42 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:29:42 0 [Note] WSREP: Closing send monitor...
2021-09-23  8:29:42 0 [Note] WSREP: Closed send monitor.
2021-09-23  8:29:42 0 [Note] WSREP: gcomm: terminating thread
2021-09-23  8:29:42 0 [Note] WSREP: gcomm: joining thread
2021-09-23  8:29:42 0 [Note] WSREP: gcomm: closing backend
2021-09-23  8:29:42 0 [Note] WSREP: view(view_id(NON_PRIM,077cec27-842c,167) memb {
        077cec27-842c,0
} joined {
} left {
} partitioned {
        25c6e10d-9ff7,0
        4e87b761-b561,0
        ed270aba-aedd,0
})
2021-09-23  8:29:42 0 [Note] WSREP: PC protocol downgrade 1 -> 0
2021-09-23  8:29:42 0 [Note] WSREP: view((empty))
2021-09-23  8:29:42 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.10.20:50904
2021-09-23  8:29:42 0 [Note] WSREP: gcomm: closed
2021-09-23  8:29:42 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2021-09-23  8:29:42 0 [Note] WSREP: Flow-control interval: [16, 16]
2021-09-23  8:29:42 0 [Note] WSREP: Received NON-PRIMARY.
2021-09-23  8:29:42 0 [Note] WSREP: New SELF-LEAVE.
2021-09-23  8:29:42 0 [Note] WSREP: Flow-control interval: [0, 0]
2021-09-23  8:29:42 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2021-09-23  8:29:42 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 9842007)
2021-09-23  8:29:42 0 [Note] WSREP: RECV thread exiting 0: Success
2021-09-23  8:29:42 6 [Note] WSREP: ================================================
View:
  id: d326832d-56e2-11eb-80c1-760504343273:9842007
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
        0: 077cec27-1b75-11ec-842c-5f218e28b692, Strike
=================================================

2021-09-23  8:29:42 6 [Note] WSREP: Non-primary view
2021-09-23  8:29:42 6 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:29:42 0 [Note] WSREP: recv_thread() joined.
2021-09-23  8:29:42 0 [Note] WSREP: Closing replication queue.
2021-09-23  8:29:42 0 [Note] WSREP: Closing slave action queue.
2021-09-23  8:29:42 6 [Note] WSREP: ================================================
View:
  id: d326832d-56e2-11eb-80c1-760504343273:9842007
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: yes
  own_index: -1
  members(0):
=================================================

同样发生在我的第二个节点上:

代码语言:javascript
复制
2021-09-23  8:31:01 7 [Note] WSREP: ================================================
View:
  id: d326832d-56e2-11eb-80c1-760504343273:9842007
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
        0: 25c6e10d-1b75-11ec-9ff7-de60adb87197, Aegis
=================================================
2021-09-23  8:31:01 7 [Note] WSREP: Non-primary view
2021-09-23  8:31:01 7 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:31:01 0 [Note] WSREP: recv_thread() joined.
2021-09-23  8:31:01 0 [Note] WSREP: Closing replication queue.
2021-09-23  8:31:01 0 [Note] WSREP: Closing slave action queue.
2021-09-23  8:31:01 7 [Note] WSREP: ================================================
View:
  id: d326832d-56e2-11eb-80c1-760504343273:9842007
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: yes
  own_index: -1
  members(0):
=================================================

第三节点:

代码语言:javascript
复制
2021-09-23  8:30:59 2 [Note] WSREP: ================================================
View:
  id: d326832d-56e2-11eb-80c1-760504343273:9842007
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
        0: 4e87b761-1b96-11ec-b561-272e3101cb38, Duel
=================================================
2021-09-23  8:30:59 2 [Note] WSREP: Non-primary view
2021-09-23  8:30:59 2 [Note] WSREP: Server status change connected -> connected
2021-09-23  8:30:59 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:30:59 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:31:03 0 [Note] WSREP: (4e87b761-b561, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.10.10:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 385031380 cwnd: 1 last_queued_since: 385331379302597 last_delivered_since: 385331379302597 send_queue_length: 0 send_queue_bytes: 0

2021-09-23  8:31:04 0 [Note] WSREP: (4e87b761-b561, 'tcp://0.0.0.0:4567') reconnecting to ed270aba-aedd (tcp://10.10.10.10:4567), attempt 90
2021-09-23  8:31:05 0 [Note] WSREP:  cleaning up 25c6e10d-9ff7 (tcp://10.10.10.20:4567)
2021-09-23  8:31:08 0 [Note] WSREP: (4e87b761-b561, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.10.41:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 385036380 cwnd: 1 last_queued_since: 385336379962010 last_delivered_since: 385336379962010 send_queue_length: 0 send_queue_bytes: 0


2021-09-23  8:31:49 2 [Note] WSREP: ================================================
    View:
      id: d326832d-56e2-11eb-80c1-760504343273:9842007
      status: non-primary
      protocol_version: 4
      capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
      final: yes
      own_index: -1
      members(0):
    =================================================

至于我的云节点,发生了这样的事情:

代码语言:javascript
复制
2021-09-23  8:29:07 0 [Note] WSREP: (ed270aba-aedd, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.10.30:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 684592357 cwnd: 1 last_queued_since: 684892356898817 last_delivered_since: 684892356898817 send_queue_length: 0 send_queue_bytes: 0
2021-09-23  8:29:07 0 [Note] WSREP: (ed270aba-aedd, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.10.40:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 684592358 cwnd: 1 last_queued_since: 684892357027577 last_delivered_since: 684892357027577 send_queue_length: 0 send_queue_bytes: 0
2021-09-23  8:29:09 0 [Note] WSREP: (ed270aba-aedd, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.10.20:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 684593857 cwnd: 1 last_queued_since: 684893856941066 last_delivered_since: 684893856941066 send_queue_length: 0 send_queue_bytes: 0
2021-09-23  8:32:28 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
2021-09-23  8:32:28 0 [Note] WSREP: Shutdown replication
2021-09-23  8:32:28 0 [Note] WSREP: Server status change connected -> disconnecting
2021-09-23  8:32:28 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-23  8:32:28 0 [Note] WSREP: Closing send monitor...
2021-09-23  8:32:28 0 [Note] WSREP: Closed send monitor.
2021-09-23  8:32:28 0 [Note] WSREP: gcomm: terminating thread
2021-09-23  8:32:28 0 [Note] WSREP: gcomm: joining thread
2021-09-23  8:32:28 0 [Note] WSREP: gcomm: closing backend
2021-09-23  8:32:28 0 [Note] WSREP: PC protocol downgrade 1 -> 0
2021-09-23  8:32:28 0 [Note] WSREP: view((empty))
2021-09-23  8:32:28 0 [Note] WSREP: gcomm: closed
2021-09-23  8:32:28 0 [Note] WSREP: New SELF-LEAVE.
2021-09-23  8:32:28 0 [Note] WSREP: Flow-control interval: [0, 0]
2021-09-23  8:32:28 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2021-09-23  8:32:28 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 9842007)
2021-09-23  8:32:28 0 [Note] WSREP: RECV thread exiting 0: Success
2021-09-23  8:32:28 0 [Note] WSREP: recv_thread() joined.
2021-09-23  8:32:28 0 [Note] WSREP: Closing replication queue.
2021-09-23  8:32:28 0 [Note] WSREP: Closing slave action queue.
2021-09-23  8:32:28 2 [Note] WSREP: ================================================
View:
  id: d326832d-56e2-11eb-80c1-760504343273:9842007
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: yes
  own_index: -1
  members(0):
=================================================

我有以下配置:

节点1:

代码语言:javascript
复制
[galera] # Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so

#add your node ips here
wsrep_cluster_address="gcomm://strike,aegis,duel,clone"
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
#Cluster name
wsrep_cluster_name="cloud_test_cluster"
# Allow server to accept connections on all interfaces.

bind-address=0.0.0.0

# this server ip, change for each server
wsrep_node_address="strike"
# this server name, change for each server
wsrep_node_name="Strike"

wsrep_sst_method=rsync

wsrep_sst_donor="Aegis,Duel"

节点2与上述相同,但:

代码语言:javascript
复制
wsrep_sst_donor="Strike,Duel"

节点3:

代码语言:javascript
复制
wsrep_sst_donor="Strike,Aegis"

最后,云:

代码语言:javascript
复制
wsrep_sst_donor="Duel,Aegis,Strike"

所有这些几乎是同时发生的。他们刚刚失去联系了吗?与10.10.10失去联系是否导致了坠机?为什么成员之间的数量减少了?这是昨天和今天发生的。我从上个周末开始就做了这个,但是直到昨天还没有问题.

有人能向我解释一下发生了什么吗?

EN

回答 1

Database Administration用户

发布于 2021-09-24 00:10:57

您的Strike节点:

代码语言:javascript
复制
2021-09-23  8:29:42 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown

所以看上去它只是发送了一个SIGTERM来关闭这个服务。

其他节点日志是在Strike关机之后。

注意,所有节点都有non-primary。如果在电源故障后启动,请参阅重新启动群集文档,需要手动指定其中一个节点作为新的主节点。

这涉及到确定最新的节点(单数),并仅在该节点上运行galera_recovery。然后启动其他节点。

也许Strike的终止是由于没有足够快地到达主状态的结果。

票数 2
EN
页面原文内容由Database Administration提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://dba.stackexchange.com/questions/300022

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档