我们首先使用功能较弱的服务器来建立MariaDB复制,但是考虑到我们增加的数据库负载,我们不得不租用更强大的服务器,这时问题就开始了。
现在,主从之间的复制有时会延迟在0到1000秒之间(它正在缓慢上升到1000秒,大约是两秒钟),原因我们还无法确定。这种情况平均每天发生十几次。考虑到我公司业务的性质,我们根本承受不起任何滞后。
主处理器有6核Intel CPU,128 on内存,1TB SSD在raid 1配置中,平均负载略高于2,在Ubuntu20.04LTS上运行MariaDB 10.4.2x。
在过去的大约18个小时里,日志中只有两个与复制相关的错误/警告:
Oct 21 12:00:35 master.com mysqld[808]: 2022-10-21 12:00:35 5677386 [Warning] Aborted connection 5677386 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)
Oct 21 12:03:46 master.com mysqld[808]: 2022-10-21 12:03:46 5780035 [Warning] Aborted connection 5780035 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)从站具有完全相同的HW配置,负载平均值接近1。
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: master.com
Master_User: replication
Master_Port: 3306
Connect_Retry: 5
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 133718212
Relay_Log_Space: 133718872
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Seconds_Behind_Master: 0
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 105
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-105-9982390796
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
Slave_Transactional_Groups: 2226622服务器之间有1Gbit的连接,几乎是完美的。ping -f报告的丢失数据包不到0.002%:
root@master # ping -f slave.com
PING slave.com 56(84) bytes of data.
.^C
--- slave.com ping statistics ---
71492 packets transmitted, 71491 received, 0.00139876% packet loss, time 25747ms
rtt min/avg/max/mdev = 0.239/0.339/1.941/0.066 ms, ipg/ewma 0.360/0.335 ms我在互联网上搜索过这个问题,但在试图理解问题的根本原因或如何调试问题方面,我没有找到任何令人满意的答案。
现在,它正在按预期工作,但早些时候,当我们遇到这个问题时,我在主人身上看到了这一点:
$ SELECT * FROM information_schema.processlist WHERE User="replication"\G;
USER: replication
HOST: slave.com:47200
DB: NULL
COMMAND: Binlog Dump
TIME: 2041
STATE: Master has sent all binlog to slave; waiting for binlog to be updated我不知道该怎么做。为什么“它把所有的二进制日志都发送给奴隶了”?为什么这个时候的滞后还在增加呢?有什么坚持吗?为什么没有将主SQL查询发送给从服务器?
截至目前,世界协调时间2022-11-01时10:48,滞后正在缓慢增加:
MariaDB [(none)]> SHOW SLAVE STATUS\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: master.com
Master_User: replication
Master_Port: 3306
Connect_Retry: 5
Master_Log_File: mariadb-bin.002061
Read_Master_Log_Pos: 395606907
Relay_Log_File: mysqld-relay-bin.000253
Relay_Log_Pos: 340046809
Relay_Master_Log_File: mariadb-bin.002061
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 340046508
Relay_Log_Space: 395607567
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 389
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 105
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-105-10062038390
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Closing tables
Slave_DDL_Groups: 8
Slave_Non_Transactional_Groups: 0
Slave_Transactional_Groups: 81798798
1 row in set (0.000 sec)
ERROR: No query specified
MariaDB [(none)]> SHOW FULL PROCESSLIST\G;
*************************** 1. row ***************************
Id: 1
User: system user
Host:
db: NULL
Command: Daemon
Time: NULL
State: InnoDB purge coordinator
Info: NULL
Progress: 0.000
*************************** 2. row ***************************
Id: 2
User: system user
Host:
db: NULL
Command: Daemon
Time: NULL
State: InnoDB purge worker
Info: NULL
Progress: 0.000
*************************** 3. row ***************************
Id: 3
User: system user
Host:
db: NULL
Command: Daemon
Time: NULL
State: InnoDB purge worker
Info: NULL
Progress: 0.000
*************************** 4. row ***************************
Id: 4
User: system user
Host:
db: NULL
Command: Daemon
Time: NULL
State: InnoDB purge worker
Info: NULL
Progress: 0.000
*************************** 5. row ***************************
Id: 5
User: system user
Host:
db: NULL
Command: Daemon
Time: NULL
State: InnoDB shutdown handler
Info: NULL
Progress: 0.000
*************************** 6. row ***************************
Id: 11
User: system user
Host:
db: NULL
Command: Slave_IO
Time: 945669
State: Waiting for master to send event
Info: NULL
Progress: 0.000
*************************** 7. row ***************************
Id: 12
User: system user
Host:
db: NULL
Command: Slave_SQL
Time: 0
State: Closing tables
Info: NULL
Progress: 0.000
*************************** 8. row ***************************
Id: 50
User: replication
Host: XX.XX.XX.XXX:52970
db: NULL
Command: Binlog Dump
Time: 945656
State: Master has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
Progress: 0.000
*************************** 9. row ***************************
Id: 1244933
User: replication
Host: XX.XX.XX.XX:54514
db: NULL
Command: Binlog Dump
Time: 83210
State: Master has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
Progress: 0.000
*************************** 10. row ***************************
Id: 1372686
User: root
Host: localhost
db: NULL
Command: Query
Time: 0
State: Init
Info: SHOW FULL PROCESSLIST
Progress: 0.000
10 rows in set (0.000 sec)
ERROR: No query specified
MariaDB [(none)]>
MySQL on localhost (10.4.26) load 1.08 1.31 1.29 1/388 427200 up 10+22:42:50 [10:47:22]
Queries: 34.0 qps: 0 Slow: 0.0 Se/In/Up/De(%): 00/00/00/00
Sorts: 0 qps now: 1 Slow qps: 0.0 Threads: 3 ( 1/ 1) 00/00/00/00
Key Efficiency: 100.0% Bps in/out: 0.0/ 0.2 Now in/out: 21.2/ 3.8k
Replication IO:Yes SQL:Yes Delay: 464 sec.
Id User Host/IP DB Time Cmd State Query
-- ---- ------- -- ---- --- ----- ----------
11 system us 945768 Slave_ Waiting
50 replicati slave2 945756 Binlog Master h
1244933 replicati slave3 83310 Binlog Master h
1 system us 0 Daemon InnoDB p
2 system us 0 Daemon InnoDB p
3 system us 0 Daemon InnoDB p
4 system us 0 Daemon InnoDB p
5 system us 0 Daemon InnoDB s
12 system us 0 Slave_ Closing
1372858 root localhost 0 Query Init show full processlist 请给我建议。
发布于 2023-01-05 18:55:39
原来这是MariaDB的一个已知的bug/特性。
https://mariadb.com/resources/blog/goodbye-replication-lag/
要解决这个问题,在所有充当从服务器的DB实例上,必须以这种方式将slave_parallel_threads变量从默认值1增加到更大的值:
STOP SLAVE SQL_THREAD;
SET GLOBAL slave_parallel_threads = 4;
START SLAVE SQL_THREAD;
SELECT @@slave_parallel_threads;不幸的是,此变量仅为运行时,需要在每次重新启动/重新启动MariaDB服务器后重置。
发布于 2023-05-10 06:22:31
将其放入配置中以使其永久化:
slave_parallel_threads = 4https://dba.stackexchange.com/questions/319051
复制相似问题