大家都知道DBA就像是消(背)防(锅)员(侠),因为前端应用还有开发上线的新版本都会影响到位于最底层的数据库,前方稍微有些风吹草动,就能反应在数据库的性能上。但是有的时候SQL不仅能决定数据库的性能,还能决定数据库的生死,今天的案例是开发的一条SQL,引起的MySQL数据库主从复制报错,vip漂移问题。
今天下午有客户反馈发消息有点慢,手机收到几条报警:
nagios报警如下大致意思就是vip发生了漂移
**** Nagios ***** Notification Type: PROBLEM Service: check_ha_mysqld Host: prod-bjuc-mysql1-vip Address: 192.168.87.74 State: CRITICAL Date/Time: Wed Mar 28 14:39:27 CST 2018 Additional Info: CRITICAL: HA Failover from 192.168.87.72 to 192.168.87.123, 192.168.87.123 result:PROCS OK: 1 processes with command name /usr/sbin/mysqld, args VIP=192.168.87.74, 192.168.87.72:PROCS OK: 1 processes with command name /usr/sbin/mysqld
Slave_SQL线程停止工作 ***** Nagios ***** Notification Type: PROBLEM Service: check_mysql_health_slave-sql-running Host: bjuc-mysql1 Address: 192.168.87.72 State: CRITICAL Date/Time: Wed Mar 28 14:46:06 CST 2018 Additional Info: CRITICAL - Slave sql is not running
登录数据库服务器查看主从复制的状态,发现报错1205错
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.87.123 Master_User: zhu Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000481 Read_Master_Log_Pos: 218802975 Relay_Log_File: mysql-relay-bin.000428 Relay_Log_Pos: 2428110 Relay_Master_Log_File: mysql-bin.000481 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: performance_schema,information_schema,test,mysql Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: %.% Replicate_Wild_Ignore_Table: Last_Errno: 1205 Last_Error: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)' Skip_Counter: 0 Exec_Master_Log_Pos: 182088298 Relay_Log_Space: 39142986 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1205 Last_SQL_Error: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)' Replicate_Ignore_Server_Ids: Master_Server_Id: 6 1 row in set (0.00 sec)
查看错误日志有一下报错信息:
错误日志中给出了报错原因和解决方法
tail -70 error.log
180328 14:36:05 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:36:21 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:36:38 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:36:56 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:37:15 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:37:35 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:37:56 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:38:17 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:38:38 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:38:59 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:39:20 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205 180328 14:39:20 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. 180328 14:39:20 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205 180328 14:39:20 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000481' position 182088298
查看参数结合着两个参数可以知道主从复制报错的原因为:
180328 14:39:20 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
180328 14:39:20 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
180328 14:39:20 [ERROR] Error running query, slave SQL thread aborted,Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000481' position 182088298
因为表notice_status_new产生了锁,阻塞了这条update语句
参数innodb_lock_wait_time 设置的为15秒,单个事务在等待15秒后开始报1025错:因为锁执行超时并重启事务
参数slave_transaction_retries 设置的为10次,如果事务重试次数超过10次,复制中断。(但是错误日志中尝试了11次)
mysql> show variables like '%innodb_lock_wait_timeout%'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | innodb_lock_wait_timeout | 15 | +--------------------------+-------+ 1 row in set (0.01 sec) mysql> show variables like '%slave_transaction_retries%'; +---------------------------+-------+ | Variable_name | Value | +---------------------------+-------+ | slave_transaction_retries | 10 | +---------------------------+-------+ 1 row in set (0.00 sec)
解决方法:重启slave后主从复制恢复,并开始追日志
mysql> stop slave; Query OK, 0 rows affected (0.02 sec) mysql> start slave; Query OK, 0 rows affected (0.00 sec) mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.87.123 Master_User: zhu Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000481 Read_Master_Log_Pos: 451125456 Relay_Log_File: mysql-relay-bin.000428 Relay_Log_Pos: 9541548 Relay_Master_Log_File: mysql-bin.000481 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: performance_schema,information_schema,test,mysql Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: %.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 189201736 Relay_Log_Space: 271465769 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 5184 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 6 1 row in set (0.00 sec)
vip的问题等到晚上在切回来,线上架构为MM+heartbeat
---------------------------------我是分割线--------------------------
这次问题虽然解决了,但是发现问题我们一定要找到根音,这样问题才能算是闭环。
为什么会产生这个锁呢,因为记录锁信息的表都是实时的,现在查不出当时的锁信息,但是从存储引擎的运行状态中还是能发现一些问题,数据库出现了死锁!!显然在update这张表一定会等待的!!
发现了一条奇葩SQL,被锁的那张表也是notice_status_new
UPDATE notice_status_new SET count = count + 1, push_count = push_count + 1 WHERE conversation = 2609759 AND user_id != 7384662 AND user_id IN(7359498)
show engine innodb status\G 截取死锁的这部分 ------------------------ LATEST DETECTED DEADLOCK ------------------------ 180328 14:33:52 *** (1) TRANSACTION: TRANSACTION 1B285E9F9, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1248, 3 row lock(s), undo log entries 1 MySQL thread id 3972036, OS thread handle 0x7fc0c3464700, query id 2525495433 192.168.87.42 dstatusnet Updating UPDATE notice_status_new SET count = count + 1, push_count = push_count + 1 WHERE conversation = 2609759 AND user_id != 7384662 AND user_id IN(7359498) *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 3535 page no 65360 n bits 448 index `conv_user` of table `statusnet`.`notice_status_new` trx id 1B285E9F9 lock_mode X locks rec but not gap waiting Record lock, heap no 241 PHYSICAL RECORD: n_fields 3; compact format; info bits 0 0: len 8; hex 000000000027d25f; asc ' _;; 1: len 4; hex 80704c0a; asc pL ;; 2: len 4; hex 80b1a0fe; asc ;; *** (2) TRANSACTION: TRANSACTION 1B285E9FD, ACTIVE 0 sec starting index read, thread declared inside InnoDB 500 mysql tables in use 1, locked 1 4 lock struct(s), heap size 1248, 3 row lock(s), undo log entries 1 MySQL thread id 3867908, OS thread handle 0x7fbfc6d4f700, query id 2525495437 192.168.87.42 dstatusnet Updating UPDATE notice_status_new SET count = count + 1, push_count = push_count + 1 WHERE conversation = 2609759 AND user_id != 7359498 AND user_id IN(7384662) *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 3535 page no 65360 n bits 448 index `conv_user` of table `statusnet`.`notice_status_new` trx id 1B285E9FD lock_mode X locks rec but not gap Record lock, heap no 241 PHYSICAL RECORD: n_fields 3; compact format; info bits 0 0: len 8; hex 000000000027d25f; asc ' _;; 1: len 4; hex 80704c0a; asc pL ;; 2: len 4; hex 80b1a0fe; asc ;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 3535 page no 65360 n bits 448 index `conv_user` of table `statusnet`.`notice_status_new` trx id 1B285E9FD lock_mode X locks rec but not gap waiting Record lock, heap no 242 PHYSICAL RECORD: n_fields 3; compact format; info bits 0 0: len 8; hex 000000000027d25f; asc ' _;; 1: len 4; hex 8070ae56; asc p V;; 2: len 4; hex 80b1a0fc; asc ;; *** WE ROLL BACK TRANSACTION (2)
问题找到了,找对该业务的开发,沟通后决定在周六上线新版本,彻底解决问题。
可以写一个实时监控锁的脚本,把内容输出打印,这样方便以后定位问题!
原文地址:http://blog.51cto.com/11784929/2092097