我们正在一个组复制多主安装程序中运行三个MySQL服务器.
下面是我对GR组的一个成员的配置:
# General replication settings
gtid_mode = ON
enforce_gtid_consistency = ON
master_info_repository = TABLE
relay_log_info_repository = TABLE
binlog_checksum = NONE
log_slave_updates = ON
log_bin = binlog
binlog_format = ROW
transaction_write_set_extraction = XXHASH64
loose-group_replication_bootstrap_group = OFF
loose-group_replication_start_on_boot = OFF
loose-group_replication_ssl_mode = REQUIRED
loose-group_replication_recovery_use_ssl = 1
# Shared replication group configuration
loose-group_replication_group_name = "3a3e0101-83a2-11ea-922f-000c29390e7a"
loose-group_replication_ip_whitelist = "10.0.1.XX,10.0.1.XX,10.0.1.XX"
loose-group_replication_group_seeds = "10.0.1.XX:33061,10.0.1.XX:33061,10.0.1.XX:33061"
# Single or Multi-primary mode? Uncomment these two lines
# for multi-primary mode, where any host can accept writes
loose-group_replication_single_primary_mode = OFF
loose-group_replication_enforce_update_everywhere_checks = ON(IPs混淆)
一切都很顺利直到..。我从另一个MySQL (不是GR成员)加载了一个smiple SQL文件到MySQL_1 (GR中的三个文件之一)。加载的文件是用'mysqlpump set-gtid-清除=Off‘创建的。
在这里,该SQL文件的内容:
-- Dump created by MySQL pump utility, version: 8.0.11, FreeBSD11.1 (amd64)
-- Dump start time: Tue Jun 9 12:18:03 2020
-- Server version: 8.0.11
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE;
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
SET @@SESSION.SQL_LOG_BIN= 0;
SET @OLD_TIME_ZONE=@@TIME_ZONE;
SET TIME_ZONE='+00:00';
SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT;
SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS;
SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION;
SET NAMES utf8mb4;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `massmail` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_bin */;
CREATE TABLE `massmail`.`emails` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`ext_id` bigint(20) DEFAULT NULL,
`email` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`sent` datetime DEFAULT NULL,
`bounced` datetime DEFAULT NULL,
`opened` datetime DEFAULT NULL,
`unsubscribed` datetime DEFAULT NULL,
`CTA` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
;
INSERT INTO `massmail`.`emails` VALUES (1,NULL,"user1@gmail.com","2020-06-08 16:54:56",NULL,"2020-06-08 17:55:23","2020-06-08 17:55:34","2020-06-08 17:55:47"),(2,NULL,"user2@gmail.com","2020-06-07 16:40:13",NULL,NULL,NULL,NULL),(3,NULL,"user3@gmail.com","2020-06-08 16:57:56",NULL,"2020-06-08 23:54:21","2020-06-08 18:01:01","2020-06-08 18:01:51"),(4,NULL,"user4@gmail.com","2020-06-07 19:04:13",NULL,"2020-06-08 12:05:25","2020-06-08 01:42:53",NULL),(5,NULL,"user5@gmail.com","2020-06-07 16:58:12",NULL,"2020-06-08 11:12:17","2020-06-08 11:03:38",NULL),(6,NULL,"user6@gmail.com","2020-06-07 17:04:13",NULL,"2020-06-08 11:01:03",NULL,NULL),(7,NULL,"notexistingmustbounce@nohererererer.com","2020-06-08 16:59:56","2020-06-08 17:59:59",NULL,NULL,NULL),(8,NULL,"user7@gmail.com","2020-06-07 17:16:12",NULL,"2020-06-08 00:56:23",NULL,NULL),(9,NULL,"user8@gmail.com","2020-06-08 18:03:56",NULL,NULL,NULL,NULL),(10,NULL,"user9@gmail.com","2020-06-07 17:28:12",NULL,NULL,NULL,NULL);
SET TIME_ZONE=@OLD_TIME_ZONE;
SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT;
SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS;
SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION;
SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS;
SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS;
SET SQL_MODE=@OLD_SQL_MODE;(真正的电子邮件被混淆了)
我还修改了mysql.db表,并插入了一个现有DB用户可以访问新表的行,并且不使用GRANT就刷新了特权(所有这些都是在MySQL_1上完成的)。
在此操作之后,MySQL_2和MySQL_3进入错误模式。
在日志中,我发现了以下内容:
2020-06-09T10:44:03.769050Z 996 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_recovery': Could not execute Update_rows event on table massmail.emails; Can't find record in 'emails', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000004, end_log_pos 162167168, Error_code: MY-001032
2020-06-09T10:44:03.769087Z 996 [Warning] [MY-010584] [Repl] Slave: Can't find record in 'emails' Error_code: MY-001032
2020-06-09T10:44:03.769140Z 996 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000004' position 162166781我研究了上面的错误,发现了以下线程:一些MySQL从错误?,但是它是用于主从复制的,我找不到类似于GR多主复制的东西。
这是马克斯·弗农写的,你可以阻止奴隶,跳过计数器。所以我查看了最新的二进制日志MySQL_1和MySQL_2,我可能找到了所有中断的地方.
现在我有几个问题:
mysql> stop slave;
mysql> set global sql_slave_skip_counter=1;
mysql> show slave status \G;例如,“显示从状态”给我一个空的响应集..。(因为我们使用GR,而不是主从复制).
这个解决方案不是我喜欢的,因为我想了解更多关于GR和如何解决多主GR设置中的问题。作为最后的手段,我将每天备份(对VM),我可以回滚,所以这个问题更多的是哲学性质.
谢谢你的支持,原谅我的愚蠢,GR对我来说是很新的( MySQL也是我生活中的一部分)。
斯托尼
发布于 2020-06-10 09:19:29
所以今天我做了更多的研究,可以解决这个问题。
所以我的问题的答案是:
以下是一个简短的如何:
确定问题发生的确切位置(必须在进入错误模式的主母版上执行):
less /var/db/mysql/MySQL_2.err这将给出失败的GTID:
2020-06-09T10:44:03.769050Z 996 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_recovery': Could not execute Update_rows event on table massmail.emails; Can't find record in 'emails', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000004, end_log_pos 162167168, Error_code: MY-001032
2020-06-09T10:44:03.769087Z 996 [Warning] [MY-010584] [Repl] Slave: Can't find record in 'emails' Error_code: MY-001032
2020-06-09T10:44:03.769140Z 996 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000004' position 162166781如您所见,错误发生在以下地方: binlog.000004,end_log_pos 162167168
在mysql中立即执行
SHOW BINLOG EVENTS IN 'binlog.000004';用pos 162167168搜索语句。(在我的案件中,这是一个承诺)。向上滚动并搜索最近的“Set@@SESSION.GTID_NEXT”语句。这将给您已失败的GTID。(在我的例子中是'3a3e0101-83a2-11ea-922f-000c29390e7a:46648')
在插入该语句的主母版(在我的示例中是MySQL_1)上,您可以在系统终端中执行以下操作:
mysqlbinlog -v --include-gtids='3a3e0101-83a2-11ea-922f-000c29390e7a:46648' /var/db/mysql/binlog.000004 这甚至会显示导致问题的查询。
您还可以在mysql中(再次在错误节点(MySQL_2)上)执行以下操作:
SHOW MASTER STATUS;产出:
root@localhost [(none)]> SHOW SESSION VARIABLES LIKE 'GTID_EXECUTED';
+---------------+----------------------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+----------------------------------------------------------------------------------------------------------------------+
| gtid_executed | 3a3e0101-83a2-11ea-922f-000c29390e7a:1-46647:1000007-1009647:2000007-2000010,
bbb7e668-8324-11ea-aada-000c29390e7a:1 |
+---------------+----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)在这里,您还可以看到执行在46647停止,这与46648导致问题的信息是一致的。
现在再次在error节点(MySQL_2)上执行以下操作,使用该id创建一个有效的空GTID:
STOP GROUP_REPLICATION; # just to make sure it's down!
SET GTID_NEXT="3a3e0101-83a2-11ea-922f-000c29390e7a:46648";
BEGIN; COMMIT;
SET GTID_NEXT="AUTOMATIC";之后,我用'START GROUP_REPLICATION;‘重新启动了组复制,结果发现另一个行导致了问题。我需要再次重复这个分析和修复过程(也是在第三个坏掉的节点MySQL_3上),然后一切又恢复正常。
在损坏的GTID之间还有其他GTID(必须复制到所有实例的有效GTID),并且在对所有实例的修复后检查这些有效的(在我的情况下是插入)已经复制到所有服务器。所以我没有丢失任何数据。
我希望这能帮别人节省几个小时的工作..。
https://dba.stackexchange.com/questions/268801
复制相似问题