tl;博士
当启动由3个kubernetes荚组成的新percona集群时,grastate.dat seq_no设置为-1,不会改变。在删除一个吊舱并看着它重新启动,期望它重新加入集群时,它会将它的位置设置为00000000-0000-0000-0000-000000000000:-1,并尝试连接到自己(它是以前的ip),可能是因为它是集群中的第一个?然后,它在与自身的错误连接中超时:
2017-03-26T08:38:05.374058Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S集群没有正确启动,我无法成功地重新启动集群中的豆荚。
全
当我从头开始集群的时候。有了空白的数据目录和一个新的etcd集群,一切似乎都出现了。然而,我看了看grastate.dat,我发现每个荚的seq_no是-1
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0在这一点上,我可以做mysql -h percona -u wordpress -p和连接和wordpress也工作。
场景:我有3个percona吊舱
/ # jonathan@ubuntu:~/Projects/k8wp$ kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 1 12h
etcd-1 1/1 Running 0 12h
etcd-2 1/1 Running 3 12h
etcd-3 1/1 Running 1 12h
percona-0 1/1 Running 0 8m
percona-1 1/1 Running 0 57m
percona-2 1/1 Running 0 57m当我尝试重新启动percona-0时,它会在重新启动时被踢出集群,percona-0的gvwstate.dat文件显示。
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/gvwstate.dat
my_uuid: b7571ff8-11f8-11e7-bd2d-8b50487e1523
#vwbeg
view_id: 3 b7571ff8-11f8-11e7-bd2d-8b50487e1523 3
bootstrap: 0
member: b7571ff8-11f8-11e7-bd2d-8b50487e1523 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend集群中的其他2个豆荚显示:
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/gvwstate.dat
my_uuid: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/gvwstate.dat
my_uuid: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend下面是我认为来自Percona-0的启动的相关错误:
2017-03-26T08:37:58.370605Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:'
2017-03-26T08:38:01.373345Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:38:01.373682Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-03-26T08:38:01.373750Z 0 [Note] WSREP: view(view_id(NON_PRIM,b7571ff8,5) memb {
b7571ff8,0
} joined {
} left {
} partitioned {
})
2017-03-26T08:38:01.373838Z 0 [Note] WSREP: gcomm: connected
2017-03-26T08:38:01.373872Z 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-03-26T08:38:01.373987Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2017-03-26T08:38:01.374012Z 0 [Note] WSREP: Opened channel 'wordpress-001'
2017-03-26T08:38:01.374108Z 0 [Note] WSREP: Waiting for SST to complete.
2017-03-26T08:38:01.374417Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2017-03-26T08:38:01.374469Z 0 [Note] WSREP: Flow-control interval: [16, 16]
2017-03-26T08:38:01.374491Z 0 [Note] WSREP: Received NON-PRIMARY.
2017-03-26T08:38:01.374560Z 1 [Note] WSREP: New cluster view: global state: :-1, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1它试图连接到10.52.0.26的ip实际上是以前的ip,下面是删除percona-0之前在etcd中列出的密钥列表
/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/wordpress
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress-001/10.52.0.26
/pxc-cluster/wordpress-001/10.52.0.26/hostname
/pxc-cluster/wordpress-001/10.52.0.26/ipaddr在kubectl之后删除pods/percona-0:
/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress此外,在重新启动期间,percona-0试图注册到etcd:
{"action":"create","node":{"key":"/pxc-cluster/queue/wordpress-001/00000000000000009886","value":"10.52.0.27","expiration":"2017-03-26T08:38:57.980325718Z","ttl":60,"modifiedIndex":9886,"createdIndex":9886}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/ipaddr","value":"10.52.0.27","expiration":"2017-03-26T08:38:28.01814818Z","ttl":30,"modifiedIndex":9887,"createdIndex":9887}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/hostname","value":"percona-0","expiration":"2017-03-26T08:38:28.037188157Z","ttl":30,"modifiedIndex":9888,"createdIndex":9888}}
{"action":"update","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"expiration":"2017-03-26T08:38:28.054726795Z","ttl":30,"modifiedIndex":9889,"createdIndex":9887},"prevNode":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"modifiedIndex":9887,"createdIndex":9887}}这不起作用。
来自集群percona-1的第二个成员
2017-03-26T08:37:44.069583Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.52.0.26:4567
2017-03-26T08:37:45.069756Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') reconnecting to b7571ff8 (tcp://10.52.0.26:4567), attempt 0
2017-03-26T08:37:48.570332Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:37:49.605089Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspecting node: b7571ff8
2017-03-26T08:37:49.605276Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspected node without join message, declaring inactive
2017-03-26T08:37:50.104676Z 0 [Note] WSREP: declaring c33d6a73 at tcp://10.52.2.33:4567 stable新信息:我又重新启动了percona-0,这一次它突然出现了!经过几次尝试,我意识到吊舱需要重新启动两次才能出现,即第一次删除后,就会产生上述错误,第二次删除后就可以了,并与其他成员同步。这是因为它是星系团中的第一个吊舱吗?
我测试过删除其他豆荚,但它们都恢复正常。
问题只在percona-0。
同时,如果我的节点要崩溃,那么所有的豆荚都会被一次取下来,这就是豆荚根本不恢复的情况!我怀疑这是因为没有将状态保存到grastate.dat,即seq_no保持-1,即使全局id可能更改,豆荚在mysqld关闭时退出,以及以下错误:
jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-2 | grep ERROR
2017-03-26T11:20:25.795085Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:25.795276Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:25.795544Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:25.795618Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:25.795645Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:25.795693Z 0 [ERROR] Aborting
jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-1 | grep ERROR
2017-03-26T11:20:27.093780Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:27.093977Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:27.094145Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.1.49': -110 (Connection timed out)
2017-03-26T11:20:27.094200Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:27.094227Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.1.49) failed: 7
2017-03-26T11:20:27.094247Z 0 [ERROR] Aborting
jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-0 | grep ERROR
2017-03-26T11:20:52.040214Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:52.040279Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:52.040385Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:52.040437Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:52.040471Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:52.040508Z 0 [ERROR] Aborting删除所有豆荚的grastate.dat:
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0不,gvwstate.dat
发布于 2017-03-26 13:13:47
通过将容器中的入口点更改为以下脚本来修正它:
#!/bin/bash
sed -i \"s|safe_to_bootstrap.*:.*|safe_to_bootstrap:1|1\" /var/lib/mysql/grastate.dat;
/entrypoint.sh --wsrep-new-cluster;问题是,当从崩溃中重新启动3个吊舱时,它们都会遇到以下错误:
[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)这意味着(从链接中总结),因为所有的荚都是向下的,第一个荚(荚由一个状态集管理)出现并试图重新连接到集群,但是没有找到它可以连接到的任何其他的荚,所以它会下降,下一个荚会尝试相同的事情,点击相同的错误,然后继续下去等等。
解决方案是,当新集群出现时,第一个吊舱启动一个新集群,然后所有后续的集群都会出现,并找到一个连接到的节点。它仍然会得到所有的数据。
因此,对于percona,码头容器的入口点如下所示:
exec mysqld --user=mysql --wsrep_cluster_name=$CLUSTER_NAME --wsrep_cluster_address="gcomm://$cluster_join" --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth="xtrabackup:$XTRABACKUP_PASSWORD" --log-error=${DATADIR}error.log $CMDARG因此,要运行安装程序,我所要做的就是将前面的参数--wsrep-new-cluster传递给/entrypoint.sh文件,如下所示:
/entrypoint.sh --wsrep-new-clusterPS// --我一开始单独尝试了上面的内容,但是我遇到了一个错误,说明要强制一个新的集群并使用该节点引导,我必须在/var/lib/mysql/grastate.dat中将safe_to_bootstrap从0设置为1
https://stackoverflow.com/questions/43027043
复制相似问题