三:Heartbeat高可用部署基础准备
3.1 搭建虚拟机模拟真实环境
我们安装前面的主机规划来进行配置主机
首先我们准备两台机器
给虚拟机配置IP和主机名,hosts
按照主机规划给服务器配置IP地址,如果是双网卡的机器,要记得添加网卡设备,尽可能在关机状态下添加网卡设备,然后开机登录后,执行/etc/init.d/kudzu start(centos6已经没有这个命令,可以使用start_udev来管理)检查新硬件
完成之后重启两台主机,然后通过setup配置
注意:这里不用设置网关和DNS,重启下网络服务 service network restart,另外一台主机也按上述步骤进行操作
主机名,hosts我这边配置好了就不再多说,ping检查下
Bash
[[email protected] ~]# ping node02.cn
PING node02.cn (172.10.25.27) 56(84) bytes of data.
64 bytes from node02.cn (172.10.25.27): icmp_seq=1 ttl=64 time=0.543 ms
64 bytes from node02.cn (172.10.25.27): icmp_seq=2 ttl=64 time=0.519 ms
64 bytes from node02.cn (172.10.25.27): icmp_seq=3 ttl=64 time=0.515 ms
Bash
[[email protected] ~]# ping node01.cn
PING node01.cn (172.10.25.26) 56(84) bytes of data.
64 bytes from node01.cn (172.10.25.26): icmp_seq=1 ttl=64 time=2.10 ms
64 bytes from node01.cn (172.10.25.26): icmp_seq=2 ttl=64 time=0.646 ms
64 bytes from node01.cn (172.10.25.26): icmp_seq=3 ttl=64 time=0.465 ms
3.2 配置服务器间的心跳连接
在两台机器上分别增加一条主机路由,来实现两台机器检查对端时通过这个心跳线线路检查
node01上添加路由:
Bash
/sbin/route add -host 10.25.25.17 dev eth1
echo ‘/sbin/route add -host 10.25.25.17 dev eth1‘ >>/etc/rc.local
10.25.25.17 为node02 eth1 ip地址
node02上添加路由:
Bash
/sbin/route add -host 10.25.25.16 dev eth1
echo ‘/sbin/route add -host 10.25.25.16 dev eth1‘ >>/etc/rc.local
10.25.25.16 为node01 eth1 ip地址
查看下node01的路由:
Bash
[[email protected]e01 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.25.25.17 0.0.0.0 255.255.255.255 UH 0 0 0 eth1
172.10.25.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
10.10.25.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 1003 0 0 eth1
0.0.0.0 172.10.25.2 0.0.0.0 UG 0 0 0 eth0
四:Heartbeat高可用部署
4.1 在Centos5.X中安装Heartbeat2
yum install heartbeat -y 需要执行两遍
注意:heartbeat属于不直接对外服务的软件,没有特殊的性能需求,所以该类软件一般使用yum安装效果更好,部署简单、快速,维护容易
4.2 在Centos6.X中安装heartbeat3
配置epel源
Bash
[[email protected] ~]# rpm -Uvh https://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
[[email protected] ~]# rpm -Uvh https://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
安装heartbeat
Bash
[[email protected] ~]# yum install heartbeat* -y
[[email protected] ~]# yum install heartbeat* -y
如果想将安装的软件包缓存下来,按如下操作方法
[[email protected] ~]# sed -i ‘s#keepcache=0#keepcache=1#g‘ /etc/yum.conf
[[email protected] ~]# grep keepcache /etc/yum.conf
keepcache=1 启用保留缓存
配置ha.cf文件
Bash
[[email protected] ~]# ls /usr/share/doc/heartbeat-3.0.4/ 配置文件模板目录
apphbd.cf AUTHORS COPYING ha.cf README
[[email protected] heartbeat-3.0.4]# cp ha.cf authkeys haresources /etc/ha.d/ 拷贝配置模板
ha.cf配置参数说明
[[email protected] ~]# vim /etc/ha.d/ha.cf
debugfile /var/log/ha-debug heartbeat的调试日志存放位置
logfile /var/log/ha-log heartbeat的日志存放位置
logfacility local0 在syslog服务中配置通过local0设备接收日志
keepalive 2 指定心跳间隔时间为2秒(即每2秒在eth1上发一次广播)
deadtime 30 指定备用节点在30秒内没有接收到主节点的心跳信号,则立即接管主节点的服务资源
warntime 10 指定心跳延迟的时间为10秒,当10秒钟内备份节点不能接收到主节点的心跳信号时,就会往日志中写入一个警告日志但不会切换服务
initdead 120 指定在heartbeat首次运行后,需要等待120秒才启动主服务器的资源。该选项用于解决这种情况产生的时间间隔,取值至少为deadtime的两倍。单机启动时会遇到VIP绑定很慢,为正常现象,该值设置的长的原因
#bcast eth1 指定心跳使用以太网广播方式在eth1接口上进行广播,如使用两个实际网络来传输心跳则bcast eth0 eth1,这里我们采用多播方式就不启用它
mcast eth1 225.0.0.10 694 1 0 如果采用组播通讯,在这里可以设置组播通讯所使用的接口,绑定的组播ip地址(在224.0.0.0 - 239.255.255.255间),通讯端口,ttl(time to live)所能经过路由的跳数,是否允许环回(也就是本地发出的数据包时候还接收)
#ucast eth0 192.168.1.2 如果采用单播,那么可以配置其网络接口以及所使用的ip地址
auto_failback on 用于决定,当拥有该资源的属主恢复之后,资源是否变迁:是迁移到属主上,还是在当前节点上继续运行,直到当前节点出现故障
#stonith baytech /etc/ha.d/conf/stonith.baytech 用于共享资源的集群环境中,采用stonith防御技术来保证数据的一致性
#watchdog /dev/watchdog 该指令是用于设置看门狗定时器,如果节点一分钟内都没有心跳,那么节点将重新启动
#node ken3 设置集群中的节点,注意:节点名必须与uname –n相匹配,也可以是IP地址
node node01.cn
node node02.cn
所需要配置的参数:
Bash
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
warntime 10
initdead 60
mcast eth1 225.0.0.10 694 1 0
auto_failback on
node node01.cn
node node02.cn
配置authkeys文件
Bash
[[email protected] ~]# echo 123456 | openssl sha1
(stdin)= c4f9375f9834b4e7f0a528cc65c055702bf5f24a 生成一个哈希值
Bash
[[email protected] ~]# vim /etc/ha.d/authkeys
# Authentication file. Must be mode 600 authy权限必须为600
# Available methods: crc sha1, md5. Crc doesn‘t need/want a key. 加密算法,crc加密不安全,sha1加密最好
auth 1
1 sha1 c4f9375f9834b4e7f0a528cc65c055702bf5f24a 采用sha1认证
[[email protected] ~]# chmod 600 /etc/ha.d/authkeys
提示:两台服务器都需要配置
配置haresource文件
编辑配置heartbeat资源文件/etc/ha.d/haresource
Bash
[[email protected] ~]# vim /etc/ha.d/haresources
45 node01.cn IPaddr::172.10.25.18/24/eth0
46 node02.cn IPaddr::172.10.25.10/24/eth0
说明:
node01.cn 为主机名,表示初始状态会在node01.cn绑定ip 172.10.25.18
IPaddr 为heartbeat配置IP的默认脚本,其后的IP等都是脚本的参数
172.10.25.18/24/eth0 为集群对外服务的VIP,初始启动在node01.cn上,24为子网掩码,eth0为IP绑定的实际物理网卡,为heartbeat提供对外服务通信接口
另外的配置说明:Heartbeat+Drbd+MySQL
Bash
node01.cn IPaddr::172.10.25.18/24/eth0 drbddisk::data Filesystem::/dev/drbd0::/data::ext3 mysqld
- drbddisk::data 启动drbddata资源,这里相当执行了/etc/ha.d/resource.d/drbddisk data start/stop
- Filesystem::/dev/drbd0::/data::ext3 drbd分区挂载到/data目录,这里相当执行了 /etc/ha.d/resource.d/Filesystem /dev/drbd0 /data ext3 start/stop
- mysqld 启动mysql启动脚本,必须在/etc/init.d下面
我们把配置好的3个文件直接拷到另一台服务器上
Bash
[[email protected] ~]# cd /etc/ha.d/
[[email protected] ha.d]# scp ha.cf authkeys haresources 172.10.25.27:/etc/ha.d/
到这里也基本部署的差不多了
五:检测Heartbeat高可用
5.1 启动heartbeat服务
Bash
[[email protected] ~]# /etc/init.d/heartbeat start 启动服务
[[email protected] ~]# ip add | grep 172.10*
inet 172.10.25.26/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.10/24 brd 172.10.25.255 scope global secondary eth0
inet 172.10.25.18/24 brd 172.10.25.255 scope global secondary eth0 因为对端服务没有开启所以由当前主机接管资源,所以有2个VIP
[[email protected] ~]# ps -ef | grep heartbeat
root 14637 1 0 15:26 ? 00:00:01 heartbeat: master control process
root 14643 14637 0 15:26 ? 00:00:00 heartbeat: FIFO reader
root 14644 14637 0 15:26 ? 00:00:00 heartbeat: write: mcast eth1
root 14645 14637 0 15:26 ? 00:00:00 heartbeat: read: mcast eth1
root 15476 13298 0 16:17 pts/3 00:00:00 grep heartbeat
[[email protected] ~]# tail -f /var/log/ha-debug 查看调试日志
Bash
[[email protected] ~]# /etc/init.d/heartbeat start
[[email protected] ~]# ip add | grep 172.10*
inet 172.10.25.27/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.18/24 brd 172.10.25.255 scope global secondary eth0
inet 172.10.25.10/24 brd 172.10.25.255 scope global secondary eth0
两台主机都出现两个VIP,这就是脑裂。
5.2 发生脑裂故障排查
看能否互相ping通
Bash
[[email protected] ~]# ping node02.cn
PING node02.cn (172.10.25.27) 56(84) bytes of data.
64 bytes from node02.cn (172.10.25.27): icmp_seq=1 ttl=64 time=0.302 ms
64 bytes from node02.cn (172.10.25.27): icmp_seq=2 ttl=64 time=0.554 ms
Bash
[[email protected] ~]# ping node02.cn
PING node02.cn (172.10.25.27) 56(84) bytes of data.
64 bytes from node02.cn (172.10.25.27): icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from node02.cn (172.10.25.27): icmp_seq=2 ttl=64 time=0.057 ms
可以互相PING通
查看是否是防火墙iptables影响,如果是iptables影响可以关闭iptables服务或694端口通过
Bash
[[email protected] ~]# service iptables stop
[[email protected] ~]# service iptables stop
[[email protected] ~]# service heartbeat stop
[[email protected] ~]# service heartbeat start
[[email protected] ~]# service heartbeat stop
[[email protected] ~]# service heartbeat start
[[email protected] ~]# ip add | grep 172.10*
inet 172.10.25.26/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.18/24 brd 172.10.25.255 scope global secondary eth0
[[email protected] ~]# ip add | grep 172.10*
inet 172.10.25.27/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.10/24 brd 172.10.25.255 scope global secondary eth0
查看heartbeat集群心跳信息
Bash
[[email protected] ~]# cl_status listhblinks node01.cn 查看节点所使用的心跳
eth1
[[email protected] ~]# cl_status listhblinks node02.cn
eth1
[[email protected] ~]# cl_status hblinkstatus node02.cn eth1 查看节点node01.cn的eth1心跳状态
up
[[email protected] ~]# cl_status hblinkstatus node01.cn eth1
up
5.3 heartbeat资源手动切换与故障恢复
手动切换我们得模拟故障,常见故障有网卡损坏或关闭网络、系统宕机、heartbeat服务停止、使用脚本hb_standby
Bash
[[email protected] ~]# /usr/share/heartbeat/hb_standby 完全释放
Going standby [all].
[[email protected] ~]# ip addr | grep 172.10*
inet 172.10.25.26/24 brd 172.10.25.255 scope global eth0
[[email protected] ~]# /usr/share/heartbeat/hb_takeover 完全接管
[[email protected] ~]# ip addr | grep 172.10*
inet 172.10.25.26/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.18/24 brd 172.10.25.255 scope global secondary eth0
inet 172.10.25.10/24 brd 172.10.25.255 scope global secondary eth0
我们这里就模拟heartbeat服务挂了的情况
Bash
[[email protected] ~]# /etc/init.d/heartbeat stop
[[email protected] ~]# ip add | grep 172.10* 马上就接管了
inet 172.10.25.27/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.10/24 brd 172.10.25.255 scope global secondary eth0
inet 172.10.25.18/24 brd 172.10.25.255 scope global secondary eth0
[[email protected] ~]# /etc/init.d/heartbeat start 再恢复
[[email protected] ~]# ip add | grep 172.10* 资源也接收回来了
inet 172.10.25.26/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.18/24 brd 172.10.25.255 scope global secondary eth0
[[email protected] ~]# ip add | grep 172.10*
inet 172.10.25.27/24 brd 172.10.25.255 scope global eth0
inet 172.10.25.10/24 brd 172.10.25.255 scope global secondary eth0
5.4 通过heartbeat日志分析资源接管过程
我们先停止服务,清空日志
Bash
[[email protected] ~]# /etc/init.d/heartbeat stop
[[email protected]node02 ~]# /etc/init.d/heartbeat stop
[[email protected] ~]# >/var/log/ha-log
[[email protected] ~]# >/var/log/ha-debug
[[email protected] ~]# >/var/log/ha-log
[[email protected] ~]# >/var/log/ha-debug
再另外开个终端查看日志动态
Bash
[[email protected] ~]# tail -f /var/log/ha-debug
[[email protected] ~]# /etc/init.d/heartbeat start
[[email protected] ~]# tail -f /var/log/ha-debug
Feb 27 15:29:07 node01.cn heartbeat: [11460]: info: Pacemaker support: false
Feb 27 15:29:07 node01.cn heartbeat: [11460]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
Feb 27 15:29:07 node01.cn heartbeat: [11460]: info: **************************
Feb 27 15:29:07 node01.cn heartbeat: [11460]: info: Configuration validated. Starting heartbeat 3.0.4
Feb 27 15:29:07 node01.cn heartbeat: [11461]: info: heartbeat: version 3.0.4
Feb 27 15:29:07 node01.cn heartbeat: [11461]: info: Heartbeat generation: 1456298790
Feb 27 15:29:07 node01.cn heartbeat: [11461]: info: glib: UDP multicast heartbeat started for group 225.0.0.10 port 694 interface eth1 (ttl=1 loop=0)
Feb 27 15:29:07 node01.cn heartbeat: [11461]: info: G_main_add_TriggerHandler: Added signal manual handler
Feb 27 15:29:07 node01.cn heartbeat: [11461]: info: G_main_add_TriggerHandler: Added signal manual handler
Feb 27 15:29:07 node01.cn heartbeat: [11461]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Feb 27 15:29:08 node01.cn heartbeat: [11461]: info: Local status now set to: ‘up‘ 要启动一段时间
Feb 27 15:29:07 node01.cn heartbeat: [11461]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Feb 27 15:29:08 node01.cn heartbeat: [11461]: info: Local status now set to: ‘up‘
Feb 27 15:30:08 node01.cn heartbeat: [11461]: WARN: node node02.cn: is dead node02.cn没开启服务显示挂掉了
启动node02.cn heartbeat后
Feb 27 15:33:19 node01.cn heartbeat: [11461]: info: Link node02.cn:eth1 up.
Feb 27 15:33:19 node01.cn heartbeat: [11461]: info: Status update for node node02.cn: status init
Feb 27 15:33:19 node01.cn heartbeat: [11461]: info: Status update for node node02.cn: status up
Feb 27 15:33:19 node01.cn heartbeat: [11461]: debug: StartNextRemoteRscReq(): child count 1
Feb 27 15:33:19 node01.cn heartbeat: [12151]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc(default)[12151]: 2016/02/27_15:33:19 info: Running /etc/ha.d//rc.d/status status
Feb 27 15:33:19 node01.cn heartbeat: [12169]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc(default)[12169]: 2016/02/27_15:33:19 info: Running /etc/ha.d//rc.d/status status
Feb 27 15:33:20 node01.cn heartbeat: [11461]: debug: get_delnodelist: delnodelist=
Feb 27 15:33:20 node01.cn heartbeat: [11461]: info: Status update for node node02.cn: status active
Feb 27 15:33:20 node01.cn heartbeat: [12186]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc(default)[12186]: 2016/02/27_15:33:21 info: Running /etc/ha.d//rc.d/status status
Feb 27 15:33:21 node01.cn heartbeat: [11461]: info: remote resource transition completed.
Feb 27 15:33:21 node01.cn heartbeat: [11461]: info: node01.cn wants to go standby [foreign]
Feb 27 15:33:21 node01.cn heartbeat: [11461]: info: standby: node02.cn can take our foreign resources
Feb 27 15:33:21 node01.cn heartbeat: [12203]: info: give up foreign HA resources (standby).
ResourceManager(default)[12216]: 2016/02/27_15:33:22 info: Releasing resource group: node02.cn IPaddr::172.10.25.10/24/eth0
ResourceManager(default)[12216]: 2016/02/27_15:33:22 info: Running /etc/ha.d/resource.d/IPaddr 172.10.25.10/24/eth0 stop
IPaddr(IPaddr_172.10.25.10)[12279]: 2016/02/27_15:33:22 INFO: IP status = ok, IP_CIP=
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.10.25.10)[12253]: 2016/02/27_15:33:22 INFO: Success
INFO: Success
Feb 27 15:33:22 node01.cn heartbeat: [12203]: info: foreign HA resource release completed (standby).
Feb 27 15:33:22 node01.cn heartbeat: [11461]: info: Local standby process completed [foreign].
Feb 27 15:33:22 node01.cn heartbeat: [11461]: WARN: 1 lost packet(s) for [node02.cn] [10:12]
Feb 27 15:33:22 node01.cn heartbeat: [11461]: info: remote resource transition completed.
Feb 27 15:33:22 node01.cn heartbeat: [11461]: info: No pkts missing from node02.cn!
Feb 27 15:33:22 node01.cn heartbeat: [11461]: info: Other node completed standby takeover of foreign resources