一:背景介绍
MHA(Master HA)是一款开源的MySQL的高可用工具,能在MySQL主从复制的基础上,实现自动化主服务器故障转移。虽然MHA试图从宕机的主服务器上保存二进制日志,但并不是总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失最新数据。
MHA监控复制架构的主服务器,一旦检测到主服务器故障,就会自动进行故障转移。即使有些从服务器没有收到最新的relay log,MHA自动从最新的从服务器上识别差异的relay log并把这些日志应用到其他从服务器上,因此所有的从服务器保持一致性了。还可以设置优先级指定其中的一台slave作为master的候选,由于MHA在slaves之间修复一致性,因此可以将slave变成新的master,其他的slave都以其作为新master。
MHA有两部分组成,MHA Manager(管理节点)和MHA Node(数据节点):
1:MHA Manager可以单独部署在一台独立机器上管理多个master-slave集群,也可以部署在一台slave上。MHA Manager探测集群的node节点,当发现master出现故障的时它可以自动将具有最新数据的slave提升为新的master,然后将所有其它的slave导向新的master上。整个故障转移过程对应用程序是透明的。
2:MHA node运行在每台MySQL服务器上(master/slave/manager),它通过监控具备解析和清理logs功能的脚本来加快故障转移的。
优点: 1 找出同步最成功的一台从服务器(也就是与主服务器数据最接近的那台从服务器)自动切换成主服务器。 2 如果主机还能够访问,从主服务器上找回最新从机与主机间的数据差异。 3 在每一台从服务器上操作,确定他们缺少哪些events,并分别进行补充。 4 将最新的一台从服务器提升为主服务器后,将其它从服务器重新指向新的主服务器。 缺点: 1 当群集内的数据库进行故障转移时,对外提供服务的虚拟IP也进行转移。 2 MHA管理进程需要以后台守护进程的方式运行,并有监控机制保证MHA管理进程的正常运行。 3 有监控机制保证当主机出现故障时,MHA能确定进行成功的Failover。 4 当故障主机恢复后,能重新回到群集中,并成为新的Slave,自动实现重新同步。 5 由于主机和从机上备份策略不同,进行故障转移后,自动调整cron中的调度(例如全备份)。
二:下载
https://code.google.com/p/mysql-master-ha/downloads/list?q=label:Featured
wget https://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.55.tar.gz
wget https://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.54.tar.gz
三:安装
https://code.google.com/p/mysql-master-ha/wiki/Installation#Installing_MHA_Manager
安装完成之后会产生一些相关工具:
Manager工具: - masterha_check_ssh : 检查MHA的SSH配置。 - masterha_check_repl : 检查MySQL复制。 - masterha_manager : 启动MHA。 - masterha_check_status : 检测当前MHA运行状态。 - masterha_master_monitor : 监测master是否宕机。 - masterha_master_switch : 控制故障转移(自动或手动)。 - masterha_conf_host : 添加或删除配置的server信息。 Node工具: - save_binary_logs : 保存和复制master的二进制日志。 - apply_diff_relay_logs : 识别差异的中继日志事件并应用于其它slave。 - filter_mysqlbinlog : 去除不必要的ROLLBACK事件(MHA已不再使用这个工具)。 - purge_relay_logs : 清除中继日志(不会阻塞SQL线程)。
【本文是通过源码安装】
1,MHA Manager 安装
## Install dependent Perl modules # Config::Tiny ## perl -MCPAN -e "install Config::Tiny" # Log::Dispatch ## perl -MCPAN -e "install Log::Dispatch" # Parallel::ForkManager ## perl -MCPAN -e "install Parallel::ForkManager" ## Installing MHA Manager $ tar -zxf mha4mysql-manager-X.Y.tar.gz $ perl Makefile.PL *** Module::AutoInstall version 1.03 *** Checking for Perl dependencies... [Core Features] - DBI ...loaded. (1.616) - DBD::mysql ...loaded. (4.020) - Time::HiRes ...loaded. (1.972101) - Config::Tiny ...loaded. (2.20) - Log::Dispatch ...loaded. (2.41) - Parallel::ForkManager ...loaded. (1.06) - MHA::NodeConst ...loaded. (0.54) 先装Node,这里才能通过。 *** Module::AutoInstall configuration finished. Checking if your kit is complete... Looks good Generating a Unix-style Makefile Writing Makefile for mha4mysql::manager Writing MYMETA.yml and MYMETA.json $ make $ sudo make install
2,MHA Node 安装
## Install DBD::mysql if not installed $ tar -zxf mha4mysql-node-X.Y.tar.gz $ perl Makefile.PL *** Module::AutoInstall version 1.03 *** Checking for Perl dependencies... [Core Features] - DBI ...loaded. (1.616) - DBD::mysql ...loaded. (4.020) *** Module::AutoInstall configuration finished. Checking if your kit is complete... Looks good Generating a Unix-style Makefile Writing Makefile for mha4mysql::node Writing MYMETA.yml and MYMETA.json $ make $ sudo make install
上面红色标明的不能出现miss,否则后面运行就会有问题。在安装时候要设置好locale,否则安装也会有问题:可以先运行:
export LC_ALL="zh_CN.UTF-8"
到此,MHA安装结束了。
三:测试
必要条件:
1)三台主机:201(M/node)、202(S/node)、204(S/manager)。
2)MHA按照上面的方法在三台上已经安装完毕,MySQL的Replication已经搭建完成,复制帐号三台都已经创建。
MySQL上需要设置:set global relay_log_purge=0,或则配置文件里修改。
--relay-log-purge={0|1} 禁用或启用不再需要中继日志时是否自动清空它们。默认值为1(启用)。这是一个全局变量,可以用SET GLOBAL Relay_log_purge动态更改。
3)MHA环境需要三台主机需要相互信任:实现三台主机之间相互免密钥登录。
[204] 用ssh-keygen创建公钥,一直默认回车,最后会在.ssh/下面生成id_rsa.pub ssh-keygen -t rsa 把id_rsa.pub 文件复制到另外2台机子并导入公钥: cat id_rsa.pub >> /root/.ssh/authorized_keys 当然也可以直接复制里面的内容再粘帖到authorized_keys中。 这样204可以直接访问到其他2台机子了 ssh 192.168.200.201 ssh 192.168.200.202 同理在另外2台机子上面操作,让他们之间实现无密码登入。 要是root登陆的话,需要修改ssh的配置文件:
vi /etc/ssh/sshd_config 中的 PermitRootLogin ,设置为YES,再重启SSH。
注意:Manager 要是装到某一台MySQL上,则需要自己和自己无密码登入:ssh 192.168.200.204,单独到一台服务器则不需要。
4)MHA的配置
Manager上的配置
mkdir -p /usr/local/masterha/app1 #创建数据文件目录 mkdir /etc/masterha #创建配置文件目录 安装时解压出来的samples中有2个文件: conf 是存放配置文件的模块 scripts是存放运行时的一些命令,可以把这几个名字移到/user/bin 中。用户下面灰色背景的几个参数中。
添加MHA的配置文件(Manager)
样本:(51-M,52-S,53-S,VIP:50)
[server default] user=zjy password=123# ssh_user=root repl_user=rep repl_password=rep123456 ping_interval=1 manager_workdir=/usr/local/masterha/mha_data manager_log=/usr/local/masterha/manager.log remote_workdir=/usr/local/masterha/mha_data master_ip_failover_script=/usr/bin/master_ip_failover #master_ip_online_change_script=/usr/bin/master_ip_online_change #shutdown_script=/usr/bin/power_manager report_script =/usr/bin/send_report.py #secondary_check_script= masterha_secondary_check -s remote_host1 -s remote_host2 [server1] hostname=192.168.200.51 master_binlog_dir=/var/log/mysql candidate_master=1 check_repl_delay=0 [server2] hostname=192.168.200.52 master_binlog_dir=/var/log/mysql #candidate_master=1 [server3] hostname=192.168.200.53 master_binlog_dir=/var/log/mysql #candidate_master=1
修改后的配置文件:[vi /etc/masterha/app1.cnf] [server default] user=root #mysql用戶名 password=111111 #mysql密码 ssh_user=root #ssh免密钥登录的帐号名 repl_user=rep #mysql复制帐号 repl_password=123456 #mysql复制密码 ping_interval=1 #ping间隔,用来检测master是否正常,默认是3秒,尝试三次没有回应的时候自动进行failover manager_workdir=/usr/local/masterha/app1 #数据目录,主要该目录的权限,需要有创建的权限 manager_log=/usr/local/masterha/manager.log #日志文件 remote_workdir=/usr/local/masterha/app1 #另外2台机子在运行时候需要创建的目录,注意ssh-keygen帐号的权限问题 #master_ip_failover_script=/usr/bin/master_ip_failover #master_ip_online_change_script=/usr/bin/master_ip_online_change #shutdown_script=/usr/bin/power_manager #report_script=/usr/bin/send_report
#secondary_check_script= masterha_secondary_check -s remote_host1 -s remote_host2
[server1] hostname=192.168.200.201 master_binlog_dir=/var/lib/mysql #binlog目录 candidate_master=1 #master机宕掉后,优先启用这台作为新master
check_repl_delay=0 #默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master
[server2] hostname=192.168.200.202 master_binlog_dir=/var/log/mysql candidate_master=1 [server3] hostname=192.168.200.204 master_binlog_dir=/var/log/mysql #candidate_master=1
进行测试:
◆ masterha_check_ssh工具验证ssh信任登录是否成功
[email protected]:/usr/local/masterha# masterha_check_ssh --conf=/etc/masterha/app1.cnf Wed Jun 25 17:08:40 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Wed Jun 25 17:08:40 2014 - [info] Reading application default configurations from /etc/masterha/app1.cnf.. Wed Jun 25 17:08:40 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf.. Wed Jun 25 17:08:40 2014 - [info] Starting SSH connection tests.. Wed Jun 25 17:08:41 2014 - [debug] Wed Jun 25 17:08:40 2014 - [debug] Connecting via SSH from [email protected]192.168.200.201(192.168.200.201:22) to [email protected]192.168.200.202(192.168.200.202:22).. Wed Jun 25 17:08:40 2014 - [debug] ok. Wed Jun 25 17:08:40 2014 - [debug] Connecting via SSH from [email protected]192.168.200.201(192.168.200.201:22) to [email protected]192.168.200.204(192.168.200.204:22).. Wed Jun 25 17:08:40 2014 - [debug] ok. Wed Jun 25 17:08:41 2014 - [debug] Wed Jun 25 17:08:40 2014 - [debug] Connecting via SSH from [email protected]192.168.200.202(192.168.200.202:22) to [email protected]192.168.200.201(192.168.200.201:22).. Wed Jun 25 17:08:40 2014 - [debug] ok. Wed Jun 25 17:08:40 2014 - [debug] Connecting via SSH from [email protected]192.168.200.202(192.168.200.202:22) to [email protected]192.168.200.204(192.168.200.204:22).. Wed Jun 25 17:08:41 2014 - [debug] ok. Wed Jun 25 17:08:41 2014 - [debug] Wed Jun 25 17:08:41 2014 - [debug] Connecting via SSH from [email protected]192.168.200.204(192.168.200.204:22) to [email protected]192.168.200.201(192.168.200.201:22).. Wed Jun 25 17:08:41 2014 - [debug] ok. Wed Jun 25 17:08:41 2014 - [debug] Connecting via SSH from [email protected]192.168.200.204(192.168.200.204:22) to [email protected]192.168.200.202(192.168.200.202:22).. Wed Jun 25 17:08:41 2014 - [debug] ok. Wed Jun 25 17:08:41 2014 - [info] All SSH connection tests passed successfully.
◆ masterha_check_repl工具验证mysql复制是否成功
[email protected]:/usr/local/masterha# masterha_check_repl --conf=/etc/masterha/app1.cnf
... ... 192.168.200.201 (current master) +--192.168.200.202 +--192.168.200.204 Wed Jun 25 17:27:17 2014 - [info] Checking replication health on 192.168.200.202.. Wed Jun 25 17:27:17 2014 - [info] ok. Wed Jun 25 17:27:17 2014 - [info] Checking replication health on 192.168.200.204.. Wed Jun 25 17:27:17 2014 - [info] ok. Wed Jun 25 17:27:17 2014 - [warning] master_ip_failover_script is not defined. Wed Jun 25 17:27:17 2014 - [warning] shutdown_script is not defined. Wed Jun 25 17:27:17 2014 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.
◆ masterha_manager 启动MHA manager
[email protected]:~# nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log 2>&1 查看日志文件(/masterha/app1/manager.log) Cleaning up test file(s).. done. Wed Jun 25 17:36:07 2014 - [info] Slaves settings check done. Wed Jun 25 17:36:07 2014 - [info] 192.168.200.201 (current master) +--192.168.200.202 +--192.168.200.204 Wed Jun 25 17:36:07 2014 - [warning] master_ip_failover_script is not defined. Wed Jun 25 17:36:07 2014 - [warning] shutdown_script is not defined. Wed Jun 25 17:36:07 2014 - [info] Set master ping interval 1 seconds. Wed Jun 25 17:36:07 2014 - [info] Set secondary check script: masterha_secondary_check -s remote_host1 -s remote_host2 Wed Jun 25 17:36:07 2014 - [info] Starting ping health check on 192.168.200.201(192.168.200.201:3306).. Wed Jun 25 17:36:07 2014 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn‘t respond..
参数:
--ignore_last_failover:在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件
--remove_dead_master_conf: 该参数代表当发生主从切换后,老的主库将会从配置文件中移除,一般情况下不需要开启。即:
开启程序 nohup masterha_manager --ignore_last_failover --remove_dead_master_conf --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log 2>&1 +++ 故障切换 From: 192.168.200.202 (current master) +--192.168.200.201 +--192.168.200.204 To: 192.168.200.201 (new master) +--192.168.200.204 +++ 不先处理故障的机子,可直接运行程序,不加该参数会直接报错: [error][/usr/local/share/perl/5.14.2/MHA/ServerManager.pm, ln443] Server 192.168.200.201(192.168.200.201:3306) is dead, but must be alive! Check server settings. nohup masterha_manager --ignore_last_failover --remove_dead_master_conf --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log 2>&1 可以正常运行,但是配置文件的[server2]选项组被删除
◆ 测试master(201)宕机后,是否会自动切换?这里注意的是确保 /usr/local/masterha/app1/app1.failover.complete 没有这类文件存在。否则报:
remove /usr/local/masterha/app1/app1.failover.complete and run this script again,需要加上参数:--ignore_last_failover
nohup masterha_manager --ignore_last_failover --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log 2>&1
... ... Wed Jun 25 23:45:49 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf.. Wed Jun 25 23:45:50 2014 - [info] Dead Servers: Wed Jun 25 23:45:50 2014 - [info] 192.168.200.201(192.168.200.201:3306) Wed Jun 25 23:45:50 2014 - [info] Alive Servers: Wed Jun 25 23:45:50 2014 - [info] 192.168.200.202(192.168.200.202:3306) Wed Jun 25 23:45:50 2014 - [info] 192.168.200.204(192.168.200.204:3306) Wed Jun 25 23:45:50 2014 - [info] Alive Slaves: Wed Jun 25 23:45:50 2014 - [info] 192.168.200.202(192.168.200.202:3306) Version=5.5.35-0ubuntu0.12.04.2-log (oldest major version between slaves) log-bin:enabled Wed Jun 25 23:45:50 2014 - [info] Replicating from 192.168.200.201(192.168.200.201:3306) .... .... Wed Jun 25 23:45:50 2014 - [info] Master is down! Wed Jun 25 23:45:50 2014 - [info] Starting master failover. .... .... 进行5个阶段 Creating /usr/local/masterha/app1 if not exists.. ok. ... Dumping binlog format description event, from position 0 to 107.. ok. Wed Jun 25 23:45:51 2014 - [info] scp from [email protected]192.168.200.201:/usr/local/masterha/app1/saved_master_binlog_from_192.168.200.201_3306_20140625234550.binlog to local:/usr/local/masterha/app1/saved_master_binlog_from_192.168.200.201_3306_20140625234550.binlog succeeded. ... Wed Jun 25 23:45:52 2014 - [info] New master is 192.168.200.202(192.168.200.202:3306) Wed Jun 25 23:45:52 2014 - [info] Starting master failover.. Wed Jun 25 23:45:52 2014 - [info] From: 192.168.200.201 (current master) +--192.168.200.202 +--192.168.200.204 To: 192.168.200.202 (new master) +--192.168.200.204 ... Applying log files succeeded. Wed Jun 25 23:45:52 2014 - [info] All relay logs were successfully applied. Wed Jun 25 23:45:52 2014 - [info] Getting new master‘s binlog name and position.. Wed Jun 25 23:45:52 2014 - [info] mysql-bin.000001:107 Wed Jun 25 23:45:52 2014 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=‘192.168.200.202‘, MASTER_PORT=3306, MASTER_LOG_FILE=‘mysql-bin.000001‘, MASTER_LOG_POS=107, MASTER_USER=‘rep‘, MASTER_PASSWORD=‘xxx‘; Wed Jun 25 23:45:52 2014 - [info] ** Finished master recovery successfully. ... Wed Jun 25 23:45:54 2014 - [info] Executed CHANGE MASTER. Wed Jun 25 23:45:54 2014 - [info] Slave started. Wed Jun 25 23:45:54 2014 - [info] All new slave servers recovered successfully. ... ----- Failover Report ----- app1: MySQL Master failover 192.168.200.201 to 192.168.200.202 succeeded Master 192.168.200.201 is down! Check MHA Manager logs at db4:/usr/local/masterha/manager.log for details. Started automated(non-interactive) failover. The latest slave 192.168.200.202(192.168.200.202:3306) has all relay logs for recovery. Selected 192.168.200.202 as a new master. 192.168.200.202: OK: Applying all logs succeeded. 192.168.200.204: This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.200.204: OK: Applying all logs succeeded. Slave started, replicating from 192.168.200.202. 192.168.200.202: Resetting slave info succeeded. Master failover to 192.168.200.202(192.168.200.202:3306) completed successfully.
上面只给出了一些比较重要的信息,更多的信息见配置文件指定的日志文件。大致的信息:
主(201)关闭后,查找最新的slave,通过scp把二进制日志应用过去,然后最近的从202执行了reset slave,成了新主,204从202开始同步。
老的主(201)开启之后,需要手动重新同步到新主。通过日志里的:
CHANGE MASTER TO MASTER_HOST=‘192.168.200.202‘, MASTER_PORT=3306, MASTER_LOG_FILE=‘mysql-bin.000001‘, MASTER_LOG_POS=107, MASTER_USER=‘rep‘, MASTER_PASSWORD=‘xxx‘;
再重新开启MHA程序,这里需要注意:
一旦发生切换管理进程(Manager)将会退出,无法进行再次测试,需将故障数据库解决掉之后,重新change加入到MHA环境中来或则在切换的时候就加上--remove_dead_master_conf参数,并且要保证app1.failover.complete不存在或则加上--ignore_last_failover参数忽略,才能再次开启管理进程。
◆ masterha_master_switch工具进行主从切换,在手动切换的同时需要保证没有启用MHA自动切换功能。
[email protected]:/usr/local/masterha# masterha_master_switch --master_state=alive --conf=/etc/masterha/app1.cnf Thu Jun 26 00:19:24 2014 - [info] MHA::MasterRotate version 0.55. Thu Jun 26 00:19:24 2014 - [info] Starting online master switch.. ... It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.200.202(192.168.200.202:3306)? (YES/no): yes ... From: 192.168.200.202 (current master) +--192.168.200.201 +--192.168.200.204 To:202消失 192.168.200.201 (new master) +--192.168.200.204 Starting master switch from 192.168.200.202(192.168.200.202:3306) to 192.168.200.201(192.168.200.201:3306)? (yes/NO): yes ... master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes ... Thu Jun 26 00:19:55 2014 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=‘192.168.200.201‘, MASTER_PORT=3306, MASTER_LOG_FILE=‘mysql-bin-3306.000001‘, MASTER_LOG_POS=107, MASTER_USER=‘rep‘, MASTER_PASSWORD=‘xxx‘; ... Thu Jun 26 00:19:56 2014 - [info] All new slave servers switched successfully. Thu Jun 26 00:19:56 2014 - [info] 192.168.200.201: Resetting slave info succeeded. Thu Jun 26 00:19:56 2014 - [info] Switching master to 192.168.200.201(192.168.200.201:3306) completed successfully.
加粗部分说明主从切换后的架构变化,也会在命令行提示你是否CHANGE到新主。切换之后老主(202)将会被独立出来。想要把202重新加入到复制中,则只需要执行打印出来的CHANGE MASTER 语句。也可以指定一个IP作为主:
masterha_master_switch --master_state=alive --conf=/etc/masterha/app1.cnf --new_master_host=192.168.200.201
指定201为新主,切换完之后,老主被独立。
◆ masterha_check_status 工具 查看MHA程序的状态:
[email protected]:/usr/local/masterha# masterha_check_status --conf=/etc/masterha/app1.cnf app1 is stopped(2:NOT_RUNNING). [email protected]:/usr/local/masterha# masterha_check_status --conf=/etc/masterha/app1.cnf app1 monitoring program is now on initialization phase(10:INITIALIZING_MONITOR). Wait for a while and try checking again. [email protected]:/usr/local/masterha# masterha_check_status --conf=/etc/masterha/app1.cnf app1 (pid:29232) is running(0:PING_OK), master:192.168.200.201
通过上面的信息可以看到MHA程序是否正在运行。
◆ masterha_stop 工具关闭Manager进程。
[email protected]:/usr/local/masterha# masterha_stop --conf=/etc/masterha/app1.cnf Stopped app1 successfully.
◆ 配置VIP:master_ip_failover_script=/usr/bin/master_ip_failover
样本:(51-M,52-S,53-S,VIP:50)
#!/usr/bin/env perl # Copyright (C) 2011 DeNA Co.,Ltd. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA ## Note: This is a sample script and is not complete. Modify the script based on your environment. use strict; use warnings FATAL => ‘all‘; use Getopt::Long; my ( $command, $ssh_user, $orig_master_host, $orig_master_ip, $orig_master_port, $new_master_host, $new_master_ip, $new_master_port ); my $vip = ‘192.168.200.50/24‘; my $key = ‘0‘; my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip"; my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down"; my $ssh_Bcast_arp = "/usr/bin/arping -c 3 -A 192.168.200.50"; #ARP回复模式,更新邻居。要是不加则服务器会自动等到vip缓存失效,期间VIP会有一定时间的不可用。 GetOptions( ‘command=s‘ => \$command, ‘ssh_user=s‘ => \$ssh_user, ‘orig_master_host=s‘ => \$orig_master_host, ‘orig_master_ip=s‘ => \$orig_master_ip, ‘orig_master_port=i‘ => \$orig_master_port, ‘new_master_host=s‘ => \$new_master_host, ‘new_master_ip=s‘ => \$new_master_ip, ‘new_master_port=i‘ => \$new_master_port, ); exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = 1; eval { print "Disabling the VIP on old master: $orig_master_host \n"; &stop_vip(); $exit_code = 0; }; if ([email protected]) { warn "Got Error: [email protected]\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { my $exit_code = 10; eval { print "Enabling the VIP - $vip on the new master - $new_master_host \n"; &start_vip(); &start_arp(); $exit_code = 0; }; if ([email protected]) { warn [email protected]; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; exit 0; } else { &usage(); exit 1; } } sub start_vip() { `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`; } sub start_arp() { `ssh $ssh_user\@$new_master_host \" $ssh_Bcast_arp \"`; } sub stop_vip() { `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
在使用上面的arping 时候,需要安装:
apt-get install iputils-arping hash -r 更新缓存路径
关于arping 请参考:http://blog.sina.com.cn/s/blog_88b1c39501011sr1.html
◆ 邮件、短信通知:report_script=/usr/bin/send_report
样本:(51-M,52-S,53-S,VIP:50)
#!/usr/bin/env python #-*- encoding:utf-8 -*- #------------------------------------------------------------------------------- # Name: send_report.py # Author: zhoujy #---------------------------------------------- import os import sys import time import datetime import smtplib import subprocess import fileinput import getopt from email.mime.text import MIMEText from email.mime.multipart import MIMEMultipart from email.Utils import COMMASPACE, formatdate reload(sys) sys.setdefaultencoding(‘utf8‘) def send_mail(to, subject, text, from_mail, server="localhost"): message = MIMEMultipart() message[‘From‘] = from_mail message[‘To‘] = COMMASPACE.join(to) message[‘Date‘] = formatdate(localtime=True) message[‘Subject‘] = subject message.attach(MIMEText(text,_charset=‘utf-8‘)) smtp = smtplib.SMTP(server) smtp.sendmail(from_mail, to, message.as_string()) smtp.close() if __name__ == "__main__": opts,args = getopt.getopt(sys.argv[1:],"h",["orig_master_host=","new_master_host=","new_slave_hosts=","conf=","subject=","body=","app_vip=","new_master_ssh_port=","ssh_user="]) # print opts,args for lines in opts: key,values = lines if key == ‘--orig_master_host‘: orig_master_host = values if key == ‘--new_master_host‘: new_master_host = values if key == ‘--new_slave_hosts‘: new_slave_hosts = values if key == ‘--subject‘: subject = values if key == ‘--body‘: body = values # text = sys.stdin.read() mail_list = [‘[email protected]‘] send_mail(mail_list, subject.encode("utf8"), body, "[email protected]", server="192.168.220.251")
关于getopt 参数请参考:http://blog.sina.com.cn/s/blog_88b1c39501011sr1.html
到此MySQL HA已经搭建完成,下面开始针对其进行崩溃转移测试。
◆ 测试:
Master:51,Slave:52,53,VIP:50
工具:sysbench
1:生成测试数据:
sysbench --test=oltp --mysql-table-engine=innodb --mysql-host=192.168.200.50 --mysql-db=mha_test --oltp-table-size=5000000 --mysql-user=zjy --mysql-password=123# prepare
2:在生成测试数据期间,在52上执行:关闭其IO线程,模拟延迟。53保持正常。
stop slave io_thread;
3:测试数据生成完毕之后执行:模拟主的并发等压力测试
sysbench --num-threads=16 --test=oltp --mysql-table-engine=innodb --mysql-host=192.168.200.50 --mysql-db=mha_test --oltp-table-size=5000000 --mysql-user=zjy --mysql-password=aaaaa# run
4:同时开启52的IO线程:继续同步M上的数据。
start slave io_thread;
5:关闭主库mysql进程,模拟主库发生故障,进行自动failover操作:
/etc/init.d/mysql stop
6:最后查看日志:manager.log
Wed Oct 15 16:23:11 2014 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) Wed Oct 15 16:23:11 2014 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/log/mysql --output_file=/usr/local/masterha/mha_data/save_binary_logs_test --manager_version=0.55 --binlog_prefix=mysql-bin51 Wed Oct 15 16:23:11 2014 - [info] HealthCheck: SSH to 192.168.200.51 is reachable. Wed Oct 15 16:23:12 2014 - [warning] Got error on MySQL connect: 2003 (Can‘t connect to MySQL server on ‘192.168.200.51‘ (111)) Wed Oct 15 16:23:12 2014 - [warning] Connection failed 1 time(s).. Wed Oct 15 16:23:13 2014 - [warning] Got error on MySQL connect: 2003 (Can‘t connect to MySQL server on ‘192.168.200.51‘ (111)) Wed Oct 15 16:23:13 2014 - [warning] Connection failed 2 time(s).. Wed Oct 15 16:23:14 2014 - [warning] Got error on MySQL connect: 2003 (Can‘t connect to MySQL server on ‘192.168.200.51‘ (111)) Wed Oct 15 16:23:14 2014 - [warning] Connection failed 3 time(s).. Wed Oct 15 16:23:14 2014 - [warning] Master is not reachable from health checker! Wed Oct 15 16:23:14 2014 - [warning] Master 192.168.200.51(192.168.200.51:3306) is not reachable! Wed Oct 15 16:23:14 2014 - [warning] SSH is reachable. Wed Oct 15 16:23:14 2014 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/mha.cnf again, and trying to connect to all servers to check server status.. Wed Oct 15 16:23:14 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Wed Oct 15 16:23:14 2014 - [info] Reading application default configurations from /etc/masterha/mha.cnf.. Wed Oct 15 16:23:14 2014 - [info] Reading server configurations from /etc/masterha/mha.cnf.. Wed Oct 15 16:23:14 2014 - [info] Dead Servers: Wed Oct 15 16:23:14 2014 - [info] 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:14 2014 - [info] Alive Servers: Wed Oct 15 16:23:14 2014 - [info] 192.168.200.52(192.168.200.52:3306) Wed Oct 15 16:23:14 2014 - [info] 192.168.200.53(192.168.200.53:3306) Wed Oct 15 16:23:14 2014 - [info] Alive Slaves: Wed Oct 15 16:23:14 2014 - [info] 192.168.200.52(192.168.200.52:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:14 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:14 2014 - [info] 192.168.200.53(192.168.200.53:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:14 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:14 2014 - [info] Checking slave configurations.. Wed Oct 15 16:23:14 2014 - [info] read_only=1 is not set on slave 192.168.200.52(192.168.200.52:3306). Wed Oct 15 16:23:14 2014 - [warning] relay_log_purge=0 is not set on slave 192.168.200.52(192.168.200.52:3306). Wed Oct 15 16:23:14 2014 - [info] read_only=1 is not set on slave 192.168.200.53(192.168.200.53:3306). Wed Oct 15 16:23:14 2014 - [warning] relay_log_purge=0 is not set on slave 192.168.200.53(192.168.200.53:3306). Wed Oct 15 16:23:14 2014 - [info] Checking replication filtering settings.. Wed Oct 15 16:23:14 2014 - [info] Replication filtering check ok. Wed Oct 15 16:23:14 2014 - [info] Master is down! Wed Oct 15 16:23:14 2014 - [info] Terminating monitoring script. Wed Oct 15 16:23:14 2014 - [info] Got exit code 20 (Master dead). Wed Oct 15 16:23:14 2014 - [info] MHA::MasterFailover version 0.55. Wed Oct 15 16:23:14 2014 - [info] Starting master failover. Wed Oct 15 16:23:14 2014 - [info] Wed Oct 15 16:23:14 2014 - [info] * Phase 1: Configuration Check Phase.. Wed Oct 15 16:23:14 2014 - [info] Wed Oct 15 16:23:14 2014 - [info] Dead Servers: Wed Oct 15 16:23:14 2014 - [info] 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:14 2014 - [info] Checking master reachability via mysql(double check).. Wed Oct 15 16:23:14 2014 - [info] ok. Wed Oct 15 16:23:14 2014 - [info] Alive Servers: Wed Oct 15 16:23:14 2014 - [info] 192.168.200.52(192.168.200.52:3306) Wed Oct 15 16:23:14 2014 - [info] 192.168.200.53(192.168.200.53:3306) Wed Oct 15 16:23:14 2014 - [info] Alive Slaves: Wed Oct 15 16:23:14 2014 - [info] 192.168.200.52(192.168.200.52:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:14 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:14 2014 - [info] 192.168.200.53(192.168.200.53:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:14 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:14 2014 - [info] ** Phase 1: Configuration Check Phase completed. Wed Oct 15 16:23:14 2014 - [info] Wed Oct 15 16:23:14 2014 - [info] * Phase 2: Dead Master Shutdown Phase.. Wed Oct 15 16:23:14 2014 - [info] Wed Oct 15 16:23:14 2014 - [info] Forcing shutdown so that applications never connect to the current master.. Wed Oct 15 16:23:14 2014 - [info] Executing master IP deactivatation script: Wed Oct 15 16:23:14 2014 - [info] /usr/bin/master_ip_failover --orig_master_host=192.168.200.51 --orig_master_ip=192.168.200.51 --orig_master_port=3306 --command=stopssh --ssh_user=root IN SCRIPT TEST====/sbin/ifconfig eth0:0 down==/sbin/ifconfig eth0:0 192.168.200.50/24=== Disabling the VIP on old master: 192.168.200.51 Wed Oct 15 16:23:15 2014 - [info] done. Wed Oct 15 16:23:15 2014 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Wed Oct 15 16:23:15 2014 - [info] * Phase 2: Dead Master Shutdown Phase completed. Wed Oct 15 16:23:15 2014 - [info] Wed Oct 15 16:23:15 2014 - [info] * Phase 3: Master Recovery Phase.. Wed Oct 15 16:23:15 2014 - [info] Wed Oct 15 16:23:15 2014 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Wed Oct 15 16:23:15 2014 - [info] Wed Oct 15 16:23:15 2014 - [info] The latest binary log file/position on all slaves is mysql-bin51.000080:102992428 Wed Oct 15 16:23:15 2014 - [info] Latest slaves (Slaves that received relay log files to the latest): Wed Oct 15 16:23:15 2014 - [info] 192.168.200.52(192.168.200.52:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:15 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:15 2014 - [info] 192.168.200.53(192.168.200.53:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:15 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:15 2014 - [info] The oldest binary log file/position on all slaves is mysql-bin51.000080:102992428 Wed Oct 15 16:23:15 2014 - [info] Oldest slaves: Wed Oct 15 16:23:15 2014 - [info] 192.168.200.52(192.168.200.52:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:15 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:15 2014 - [info] 192.168.200.53(192.168.200.53:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:15 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:15 2014 - [info] Wed Oct 15 16:23:15 2014 - [info] * Phase 3.2: Saving Dead Master‘s Binlog Phase.. Wed Oct 15 16:23:15 2014 - [info] Wed Oct 15 16:23:15 2014 - [info] Fetching dead master‘s binary logs.. Wed Oct 15 16:23:15 2014 - [info] Executing command on the dead master 192.168.200.51(192.168.200.51:3306): save_binary_logs --command=save --start_file=mysql-bin51.000080 --start_pos=102992428 --binlog_dir=/var/log/mysql --output_file=/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 Creating /usr/local/masterha/mha_data if not exists.. ok. Concat binary/relay logs from mysql-bin51.000080 pos 102992428 to mysql-bin51.000080 EOF into /usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog .. Dumping binlog format description event, from position 0 to 107.. ok. Dumping effective binlog data from /var/log/mysql/mysql-bin51.000080 position 102992428 to tail(102992447).. ok. Concat succeeded. Wed Oct 15 16:23:17 2014 - [info] scp from [email protected]192.168.200.51:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog to local:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog succeeded. Wed Oct 15 16:23:17 2014 - [info] HealthCheck: SSH to 192.168.200.52 is reachable. Wed Oct 15 16:23:17 2014 - [info] HealthCheck: SSH to 192.168.200.53 is reachable. Wed Oct 15 16:23:18 2014 - [info] Wed Oct 15 16:23:18 2014 - [info] * Phase 3.3: Determining New Master Phase.. Wed Oct 15 16:23:18 2014 - [info] Wed Oct 15 16:23:18 2014 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Wed Oct 15 16:23:18 2014 - [info] All slaves received relay logs to the same position. No need to resync each other. Wed Oct 15 16:23:18 2014 - [warning] Slave 192.168.200.52(192.168.200.52:3306) SQL Thread delays too much. Latest log file:mysql-bin51.000080:102992428, Current log file:mysql-bin51.000079:6712637. This server is not selected as a new master because recovery will take long time. Wed Oct 15 16:23:18 2014 - [info] Searching new master from slaves.. Wed Oct 15 16:23:18 2014 - [info] Candidate masters from the configuration file: Wed Oct 15 16:23:18 2014 - [info] Non-candidate masters: Wed Oct 15 16:23:18 2014 - [info] 192.168.200.52(192.168.200.52:3306) Version=5.5.38-0ubuntu0.12.04.1-log (oldest major version between slaves) log-bin:enabled Wed Oct 15 16:23:18 2014 - [info] Replicating from 192.168.200.51(192.168.200.51:3306) Wed Oct 15 16:23:18 2014 - [info] Searching from all slaves which have received the latest relay log events.. Wed Oct 15 16:23:18 2014 - [info] New master is 192.168.200.53(192.168.200.53:3306) Wed Oct 15 16:23:18 2014 - [info] Starting master failover.. Wed Oct 15 16:23:18 2014 - [info] From: 192.168.200.51 (current master) +--192.168.200.52 +--192.168.200.53 To: 192.168.200.53 (new master) +--192.168.200.52 Wed Oct 15 16:23:18 2014 - [info] Wed Oct 15 16:23:18 2014 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Wed Oct 15 16:23:18 2014 - [info] Wed Oct 15 16:23:18 2014 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Wed Oct 15 16:23:18 2014 - [info] Sending binlog.. Wed Oct 15 16:23:18 2014 - [info] scp from local:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog to [email protected]192.168.200.53:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog succeeded. Wed Oct 15 16:23:18 2014 - [info] Wed Oct 15 16:23:18 2014 - [info] * Phase 3.4: Master Log Apply Phase.. Wed Oct 15 16:23:18 2014 - [info] Wed Oct 15 16:23:18 2014 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Wed Oct 15 16:23:18 2014 - [info] Starting recovery on 192.168.200.53(192.168.200.53:3306).. Wed Oct 15 16:23:18 2014 - [info] Generating diffs succeeded. Wed Oct 15 16:23:18 2014 - [info] Waiting until all relay logs are applied. Wed Oct 15 16:23:18 2014 - [info] done. Wed Oct 15 16:23:18 2014 - [info] Getting slave status.. Wed Oct 15 16:23:18 2014 - [info] This slave(192.168.200.53)‘s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin51.000080:102992428). No need to recover from Exec_Master_Log_Pos. Wed Oct 15 16:23:18 2014 - [info] Connecting to the target slave host 192.168.200.53, running recover script.. Wed Oct 15 16:23:18 2014 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=‘zjy‘ --slave_host=192.168.200.53 --slave_ip=192.168.200.53 --slave_port=3306 --apply_files=/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog --workdir=/usr/local/masterha/mha_data --target_version=5.5.38-0ubuntu0.12.04.1-log --timestamp=20141015162314 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Wed Oct 15 16:23:18 2014 - [info] Applying differential binary/relay log files /usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog on 192.168.200.53:3306. This may take long time... Applying log files succeeded. Wed Oct 15 16:23:18 2014 - [info] All relay logs were successfully applied. Wed Oct 15 16:23:18 2014 - [info] Getting new master‘s binlog name and position.. Wed Oct 15 16:23:18 2014 - [info] mysql-bin53.000048:107 Wed Oct 15 16:23:18 2014 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=‘192.168.200.53‘, MASTER_PORT=3306, MASTER_LOG_FILE=‘mysql-bin53.000048‘, MASTER_LOG_POS=107, MASTER_USER=‘rep‘, MASTER_PASSWORD=‘xxx‘; Wed Oct 15 16:23:18 2014 - [info] Executing master IP activate script: Wed Oct 15 16:23:18 2014 - [info] /usr/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.200.51 --orig_master_ip=192.168.200.51 --orig_master_port=3306 --new_master_host=192.168.200.53 --new_master_ip=192.168.200.53 --new_master_port=3306 --new_master_user=‘zjy‘ --new_master_password=‘aaa#‘ Unknown option: new_master_user Unknown option: new_master_password IN SCRIPT TEST====/sbin/ifconfig eth0:0 down==/sbin/ifconfig eth0:0 192.168.200.50/24=== Enabling the VIP - 192.168.200.50/24 on the new master - 192.168.200.53 bash: /usr/bin/arping: No such file or directory Wed Oct 15 16:23:19 2014 - [info] OK. Wed Oct 15 16:23:19 2014 - [info] ** Finished master recovery successfully. Wed Oct 15 16:23:19 2014 - [info] * Phase 3: Master Recovery Phase completed. Wed Oct 15 16:23:19 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] * Phase 4: Slaves Recovery Phase.. Wed Oct 15 16:23:19 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Wed Oct 15 16:23:19 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] -- Slave diff file generation on host 192.168.200.52(192.168.200.52:3306) started, pid: 43858. Check tmp log /usr/local/masterha/mha_data/192.168.200.52_3306_20141015162314.log if it takes time.. Wed Oct 15 16:23:19 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] Log messages from 192.168.200.52 ... Wed Oct 15 16:23:19 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Wed Oct 15 16:23:19 2014 - [info] End of log messages from 192.168.200.52. Wed Oct 15 16:23:19 2014 - [info] -- 192.168.200.52(192.168.200.52:3306) has the latest relay log events. Wed Oct 15 16:23:19 2014 - [info] Generating relay diff files from the latest slave succeeded. Wed Oct 15 16:23:19 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Wed Oct 15 16:23:19 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] -- Slave recovery on host 192.168.200.52(192.168.200.52:3306) started, pid: 43860. Check tmp log /usr/local/masterha/mha_data/192.168.200.52_3306_20141015162314.log if it takes time.. Wed Oct 15 16:24:24 2014 - [info] Wed Oct 15 16:24:24 2014 - [info] Log messages from 192.168.200.52 ... Wed Oct 15 16:24:24 2014 - [info] Wed Oct 15 16:23:19 2014 - [info] Sending binlog.. Wed Oct 15 16:23:19 2014 - [info] scp from local:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog to [email protected]192.168.200.52:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog succeeded. Wed Oct 15 16:23:19 2014 - [info] Starting recovery on 192.168.200.52(192.168.200.52:3306).. Wed Oct 15 16:23:19 2014 - [info] Generating diffs succeeded. Wed Oct 15 16:23:19 2014 - [info] Waiting until all relay logs are applied. Wed Oct 15 16:24:24 2014 - [info] done. Wed Oct 15 16:24:24 2014 - [info] Getting slave status.. Wed Oct 15 16:24:24 2014 - [info] This slave(192.168.200.52)‘s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin51.000080:102992428). No need to recover from Exec_Master_Log_Pos. Wed Oct 15 16:24:24 2014 - [info] Connecting to the target slave host 192.168.200.52, running recover script.. Wed Oct 15 16:24:24 2014 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=‘zjy‘ --slave_host=192.168.200.52 --slave_ip=192.168.200.52 --slave_port=3306 --apply_files=/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog --workdir=/usr/local/masterha/mha_data --target_version=5.5.38-0ubuntu0.12.04.1-log --timestamp=20141015162314 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Wed Oct 15 16:24:24 2014 - [info] Applying differential binary/relay log files /usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog on 192.168.200.52:3306. This may take long time... Applying log files succeeded. Wed Oct 15 16:24:24 2014 - [info] All relay logs were successfully applied. Wed Oct 15 16:24:24 2014 - [info] Resetting slave 192.168.200.52(192.168.200.52:3306) and starting replication from the new master 192.168.200.53(192.168.200.53:3306).. Wed Oct 15 16:24:24 2014 - [info] Executed CHANGE MASTER. Wed Oct 15 16:24:24 2014 - [info] Slave started. Wed Oct 15 16:24:24 2014 - [info] End of log messages from 192.168.200.52. Wed Oct 15 16:24:24 2014 - [info] -- Slave recovery on host 192.168.200.52(192.168.200.52:3306) succeeded. Wed Oct 15 16:24:24 2014 - [info] All new slave servers recovered successfully. Wed Oct 15 16:24:24 2014 - [info] Wed Oct 15 16:24:24 2014 - [info] * Phase 5: New master cleanup phase.. Wed Oct 15 16:24:24 2014 - [info] Wed Oct 15 16:24:24 2014 - [info] Resetting slave info on the new master.. Wed Oct 15 16:24:24 2014 - [info] 192.168.200.53: Resetting slave info succeeded. Wed Oct 15 16:24:24 2014 - [info] Master failover to 192.168.200.53(192.168.200.53:3306) completed successfully. Wed Oct 15 16:24:24 2014 - [info] ----- Failover Report ----- mha: MySQL Master failover 192.168.200.51 to 192.168.200.53 succeeded Master 192.168.200.51 is down! Check MHA Manager logs at ha1:/usr/local/masterha/manager.log for details. Started automated(non-interactive) failover. Invalidated master IP address on 192.168.200.51. The latest slave 192.168.200.52(192.168.200.52:3306) has all relay logs for recovery. Selected 192.168.200.53 as a new master. 192.168.200.53: OK: Applying all logs succeeded. 192.168.200.53: OK: Activated master IP address. 192.168.200.52: This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.200.52: OK: Applying all logs succeeded. Slave started, replicating from 192.168.200.53. 192.168.200.53: Resetting slave info succeeded. Master failover to 192.168.200.53(192.168.200.53:3306) completed successfully. Wed Oct 15 16:24:24 2014 - [info] Sending mail..
从上面的输出可以看出整个MHA的切换过程,共包括以下的步骤:
1.配置文件检查阶段,这个阶段会检查整个集群配置文件配置
2.宕机的master处理,这个阶段包括虚拟ip摘除操作,主机关机操作等
3.复制dead maste和最新slave相差的relay log,并保存到MHA Manger具体的目录下
4.识别含有最新更新的slave
5.应用从master保存的二进制日志事件(binlog events)
6.提升一个slave为新的master进行复制
7.使其他的slave连接新的master进行复制
可以通过Failover Report可以看到 Failover 成功,这里需要注意的是:在51宕机之后,52(延迟很大)进行追赶的时候,虚拟IP会暂时保留在原先的地方,只要追赶上去之后VIP就会切到最新的Slave上(53)。并且进行一次切换,masterha_manager 就自动停止,需要再次手动开启才能再进行监控。上面的测试说明:延迟越久,切换也越久。即使没有延迟也至少要3s以上的切换时间,因为在M发生宕机的时候需要3s的重连验证。
更多信息:
https://code.google.com/p/mysql-master-ha/w/list
http://www.cnblogs.com/gomysql/p/3675429.html