大家好!
针对6月2日 LOANDB CPU高负载问题分析
6月2日上午10点5分,发现Loandb的数据库的CPU负载非常高,CPU的IDEL值几乎接近为零。期初认为是业务方进行后台管理操作导致,在与开发联系后关闭管理后台的应用后数据库的CPU负载依然非常高。
后来在查看MYSQL innodb的数据库引擎状态信息日志中发现了问题的根本原因。 12点左右,耿会生从业务部门了解到,业务部门推送营销短信给到之前积压放款的借款人,借款人集中登陆网站办理业务,
导致数据库负载激增。
日志信息
------------------------ LATEST DETECTED DEADLOCK ------------------------ 2017-06-02 10:56:41 7f9dba9b7700 *** (1) TRANSACTION: TRANSACTION 415399551, ACTIVE 7 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 5 lock struct(s), heap size 1184, 3 row lock(s), undo log entries 1 MySQL thread id 40456, OS thread handle 0x7f9dce1b7700, query id 369288 10.18.13.27 loanuser updating UPDATE PUB_SEQUENCE SET seq_no = (CASE WHEN (POWER(10,seq_length)-seq_no=1) THEN POWER(10,seq_length-1) ELSE seq_no + 1 END) WHERE seq_type = NAME_CONST(‘in_type‘,_utf8‘serialNo10‘ COLLATE ‘utf8_general_ci‘) *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 229 page no 3 n bits 72 index `PRIMARY` of table `loandb`.`pub_sequence` trx table locks 2 total table locks 74 trx id 415399551 lock_mode X locks rec but not gap waiting lock hold time 7 wait time before grant 0 *** (2) TRANSACTION: TRANSACTION 415387082, ACTIVE 46 sec inserting mysql tables in use 1, locked 1 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 2 MySQL thread id 32752, OS thread handle 0x7f9dba9b7700, query id 381564 10.18.13.27 loanuser update insert into usr_contactperson ( UCF_CONTACTNO, UCF_PERSONNO, UCF_CONTACTTYPE, UCF_PERSONNAME, UCF_MOBILE, UCF_CREATETIME, UCF_USERNO, UCF_SOURCEORGANIZATIONNO, UCF_SOURCEPRODUCT, UCF_CHANNEL ) values ( ‘xxx‘, ‘xxx‘, ‘xxx‘, ‘xxx, ‘xxx‘, ‘xxx‘, ‘xxx‘, ‘xxx‘, ‘xx‘, ‘xx‘ ) *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 229 page no 3 n bits 72 index `PRIMARY` of table `loandb`.`pub_sequence` trx table locks 2 total table locks 74 trx id 415387082 lock_mode X locks rec but not gap lock hold time 46 wait time before grant 46 *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 2156 page no 19447 n bits 168 index `UNIQUE_PERSONNO_CONTACTTYPE_INDEX` of table `loandb`.`usr_contactperson` trx table locks 2 total table locks 5 trx id 415387082 lock mode S waiting lock hold time 0 wait time before grant 0 *** WE ROLL BACK TRANSACTION (2)
2 问题分析
如何产生死锁
借助日志的信息我们清楚的了解到,事务TRANSACTION 415387082首先持有了`loandb`.`pub_sequence`表的行级排他锁(HOLDS THE LOCK(S));
事务TRANSACTION 415399551中又恰巧执行一个Update操作需要申请获取这个锁资源,(WAITING FOR THIS LOCK TO BE GRANTED);
由于第一个事务站着锁资源没有释放,第二个事务又无法获取所资源,所以导致死锁的产生。
定位问题
从日志中的Update操作我们可以知道,第二个事务需要获取的是seq_type = NAME_CONST(‘in_type‘,_utf8‘serialNo10‘ COLLATE ‘utf8_general_ci‘)的行级排他锁;
UPDATE PUB_SEQUENCE SET seq_no = (CASE WHEN (POWER(10,seq_length)-seq_no=1) THEN POWER(10,seq_length-1) ELSE seq_no + 1 END) WHERE seq_type = NAME_CONST(‘in_type‘,_utf8‘serialNo10‘ COLLATE ‘utf8_general_ci‘);
所以问题就应该是出在与seq_type为serialNo10的相关业务上。所以我们需要对seq_type为serialNo10的相关业务代码进行优化和处理。
查看PUB_SEQUENCE表的信息
MariaDB [loandb]> select * from PUB_SEQUENCE; +------------+----------------+------------+ | seq_type | seq_no | seq_length | +------------+----------------+------------+ | serialNo10 | 1002153435 | 10 | | serialNo14 | 10000004741308 | 14 | | serialNo6 | 100114 | 6 | +------------+----------------+------------+ 3 rows in set (0.00 sec)
查看PUB_SEQUENCE表的索引信息
show index from PUB_SEQUENCE; +--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | pub_sequence | 0 | PRIMARY | 1 | seq_type | A | 3 | NULL | NULL | | BTREE | | | +--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
CPU高负载
由于PUB_SEQUENCE表被锁住,导致大量的修改PUB_SEQUENCE表的请求被堵塞,这些被堵塞的进程快速在系统中累计,造成数据的CPU高负载。
3 解决问题
和研发同事沟通后,为了减少死锁产生,进行了紧急修复,将原来放在事务中的修改pub_sequence请求,从事务中拿出,减少了所产生的机率。没有了死锁,大量的业务请求有效的获取和释放相应的资源,数据库的CPU负载很快降到合理的区间范围。
业务方调整后,CPU的负载。
建议
从上线流程上,增加sql语句审核环节;
不断优化慢SQL语句;
业务部门搞活动要提前通知,提前准备;
硬件设备性能要有冗余;