为了保证HBase集群的高可靠性,HBase支持多Backup Master 设置。当Active Master挂掉后,Backup Master可以自动接管整个HBase的集群。
该配置极其简单:
在$HBASE_HOME/conf/ 目录下新增文件配置backup-masters,在其内添加要用做Backup Master的节点hostname。如下:
[[email protected] conf]$ cat backup-masters node1
之后,启动整个集群,我们会发现,在master和node1上,都启动了HMaster进程:
[[email protected] conf]$ jps 25188 NameNode 3319 QuorumPeerMain 31725 Jps 25595 ResourceManager 31077 HMaster 25711 NodeManager 25303 DataNode 31617 Main 31220 HRegionServer
[[email protected] root]$ jps 11560 DataNode 11762 NodeManager 20769 Jps 415 QuorumPeerMain 11675 SecondaryNameNode 20394 HRegionServer 20507 HMaster
此时查看node1上master节点的log,可以看到如下的信息:
[[email protected] logs]$ tail -f hbase-hbase-master-node1.log 2015-10-10 05:35:09,609 INFO [main] mortbay.log: Started [email protected]0.0.0.0:60010 2015-10-10 05:35:09,613 INFO [main] master.HMaster: hbase.rootdir=hdfs://master:9000/hbase, hbase.cluster.distributed=true 2015-10-10 05:35:09,631 INFO [main] master.HMaster: Adding backup master ZNode /hbase/backup-masters/node1,60000,1444455307700 2015-10-10 05:35:09,806 INFO [node1:60000.activeMasterManager] master.ActiveMasterManager: Another master is the active master, master,60000,1444455305852; waiting to become the next active master 2015-10-10 05:35:09,858 INFO [master/node1/10.0.52.145:60000] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x10135dbc connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181 2015-10-10 05:35:09,858 INFO [master/node1/10.0.52.145:60000] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=hconnection-0x10135dbc0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase 2015-10-10 05:35:09,859 INFO [master/node1/10.0.52.145:60000-SendThread(node2:2181)] zookeeper.ClientCnxn: Opening socket connection to server node2/10.0.52.146:2181. Will not attempt to authenticate using SASL (unknown error) 2015-10-10 05:35:09,860 INFO [master/node1/10.0.52.145:60000-SendThread(node2:2181)] zookeeper.ClientCnxn: Socket connection established to node2/10.0.52.146:2181, initiating session 2015-10-10 05:35:09,885 INFO [master/node1/10.0.52.145:60000-SendThread(node2:2181)] zookeeper.ClientCnxn: Session establishment complete on server node2/10.0.52.146:2181, sessionid = 0x350463058c10017, negotiated timeout = 40000 2015-10-10 05:35:09,920 INFO [master/node1/10.0.52.145:60000] regionserver.HRegionServer: ClusterId : c309a039-eb35-400c-bb13-0b6ed939cc5e
该信息说明,当前hbase集群有活动的master节点,该master节点为master,所以node1节点开始等待,直到master节点上的hmaster挂掉。slave1会变成新的Active 的 Master节点。
此时,直接kill掉master节点上HMaster进程,查看node1上master节点log会发现:
2015-10-10 05:42:17,173 INFO [node1:60000.activeMasterManager] master.ActiveMasterManager: Deleting ZNode for /hbase/backup-masters/node1,60000,1444455307700 from backup master directory 2015-10-10 05:42:17,194 INFO [node1:60000.activeMasterManager] master.ActiveMasterManager: Registered Active Master=node1,60000,1444455307700 2015-10-10 05:42:17,758 INFO [node1:60000.activeMasterManager] fs.HFileSystem: Added intercepting call to namenode#getBlockLocations so can do block reordering using class class org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks 2015-10-10 05:42:17,776 INFO [node1:60000.activeMasterManager] coordination.SplitLogManagerCoordination: Found 0 orphan tasks and 0 rescan nodes 2015-10-10 05:42:17,880 INFO [node1:60000.activeMasterManager] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x29d405f7 connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181 2015-10-10 05:42:17,880 INFO [node1:60000.activeMasterManager] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=hconnection-0x29d405f70x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase 2015-10-10 05:42:17,883 INFO [node1:60000.activeMasterManager-SendThread(node2:2181)] zookeeper.ClientCnxn: Opening socket connection to server node2/10.0.52.146:2181. Will not attempt to authenticate using SASL (unknown error) 2015-10-10 05:42:17,884 INFO [node1:60000.activeMasterManager-SendThread(node2:2181)] zookeeper.ClientCnxn: Socket connection established to node2/10.0.52.146:2181, initiating session 2015-10-10 05:42:17,904 INFO [node1:60000.activeMasterManager-SendThread(node2:2181)] zookeeper.ClientCnxn: Session establishment complete on server node2/10.0.52.146:2181, sessionid = 0x350463058c1001b, negotiated timeout = 40000 2015-10-10 05:42:17,942 INFO [node1:60000.activeMasterManager] balancer.StochasticLoadBalancer: loading config 2015-10-10 05:42:18,061 INFO [node1:60000.activeMasterManager] master.HMaster: Server active/primary master=node1,60000,1444455307700, sessionid=0x150463058ac001a, setting cluster-up flag (Was=true) 2015-10-10 05:42:18,154 INFO [node1:60000.activeMasterManager] procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/online-snapshot/acquired /hbase/online-snapshot/reached /hbase/online-snapshot/abort 2015-10-10 05:42:18,184 INFO [node1:60000.activeMasterManager] procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/flush-table-proc/acquired /hbase/flush-table-proc/reached /hbase/flush-table-proc/abort 2015-10-10 05:42:18,256 INFO [node1:60000.activeMasterManager] master.MasterCoprocessorHost: System coprocessor loading is enabled 2015-10-10 05:42:18,286 INFO [node1:60000.activeMasterManager] procedure2.ProcedureExecutor: Starting procedure executor threads=5 2015-10-10 05:42:18,288 INFO [node1:60000.activeMasterManager] wal.WALProcedureStore: Starting WAL Procedure Store lease recovery 2015-10-10 05:42:18,296 INFO [node1:60000.activeMasterManager] util.FSHDFSUtils: Recovering lease on dfs file hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log 2015-10-10 05:42:18,307 INFO [node1:60000.activeMasterManager] util.FSHDFSUtils: recoverLease=true, attempt=0 on file=hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log after 9ms 2015-10-10 05:42:18,324 WARN [node1:60000.activeMasterManager] wal.WALProcedureStore: Unable to read tracker for hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log - Missing trailer: size=9 startPos=9 2015-10-10 05:42:18,373 INFO [node1:60000.activeMasterManager] wal.WALProcedureStore: Lease acquired for flushLogId: 28 2015-10-10 05:42:18,383 WARN [node1:60000.activeMasterManager] wal.ProcedureWALFormatReader: nothing left to decode. exiting with missing EOF 2015-10-10 05:42:18,383 INFO [node1:60000.activeMasterManager] wal.ProcedureWALFormatReader: No active entry found in state log hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log. removing it 2015-10-10 05:42:18,405 INFO [node1:60000.activeMasterManager] zookeeper.RecoverableZooKeeper: Process identifier=replicationLogCleaner connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181 2015-10-10 05:42:18,405 INFO [node1:60000.activeMasterManager] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=replicationLogCleaner0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase 2015-10-10 05:42:18,407 INFO [node1:60000.activeMasterManager-SendThread(node1:2181)] zookeeper.ClientCnxn: Opening socket connection to server node1/10.0.52.145:2181. Will not attempt to authenticate using SASL (unknown error) 2015-10-10 05:42:18,408 INFO [node1:60000.activeMasterManager-SendThread(node1:2181)] zookeeper.ClientCnxn: Socket connection established to node1/10.0.52.145:2181, initiating session 2015-10-10 05:42:18,426 INFO [node1:60000.activeMasterManager-SendThread(node1:2181)] zookeeper.ClientCnxn: Session establishment complete on server node1/10.0.52.145:2181, sessionid = 0x250463058780018, negotiated timeout = 40000 2015-10-10 05:42:18,464 INFO [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2015-10-10 05:42:19,970 INFO [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2015-10-10 05:42:21,475 INFO [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 3011 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2015-10-10 05:42:22,980 INFO [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 4516 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2015-10-10 05:42:23,058 INFO [PriorityRpcServer.handler=3,queue=1,port=60000] master.ServerManager: Registering server=node1,16020,1444455306545 2015-10-10 05:42:23,059 INFO [PriorityRpcServer.handler=5,queue=1,port=60000] master.ServerManager: Registering server=master,16020,1444455306763 2015-10-10 05:42:23,060 INFO [PriorityRpcServer.handler=1,queue=1,port=60000] master.ServerManager: Registering server=node2,16020,1444455305886 2015-10-10 05:42:23,081 INFO [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 4617 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2015-10-10 05:42:24,586 INFO [node1:60000.activeMasterManager] master.ServerManager: Finished waiting for region servers count to settle; checked in 3, slept for 6122 ms, expecting minimum of 1, maximum of 2147483647, master is running 2015-10-10 05:42:24,610 INFO [node1:60000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://master:9000/hbase/WALs/master,16020,1444455306763 belongs to an existing region server 2015-10-10 05:42:24,619 INFO [node1:60000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://master:9000/hbase/WALs/node1,16020,1444455306545 belongs to an existing region server 2015-10-10 05:42:24,625 INFO [node1:60000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://master:9000/hbase/WALs/node2,16020,1444455305886 belongs to an existing region server 2015-10-10 05:42:24,757 INFO [node1:60000.activeMasterManager] master.RegionStates: Transition {1588230740 state=OFFLINE, ts=1444455744651, server=null} to {1588230740 state=OPEN, ts=1444455744756, server=node2,16020,1444455305886} 2015-10-10 05:42:24,757 INFO [node1:60000.activeMasterManager] master.ServerManager: AssignmentManager hasn‘t finished failover cleanup; waiting 2015-10-10 05:42:24,760 INFO [node1:60000.activeMasterManager] master.HMaster: hbase:meta with replicaId 0 assigned=0, rit=false, location=node2,16020,1444455305886 2015-10-10 05:42:24,895 INFO [node1:60000.activeMasterManager] hbase.MetaMigrationConvertingToPB: META already up-to date with PB serialization 2015-10-10 05:42:24,985 INFO [node1:60000.activeMasterManager] master.AssignmentManager: Found regions out on cluster or in RIT; presuming failover 2015-10-10 05:42:25,000 INFO [node1:60000.activeMasterManager] master.AssignmentManager: Joined the cluster in 104ms, failover=true 2015-10-10 05:42:25,216 INFO [node1:60000.activeMasterManager] master.HMaster: Master has completed initialization 2015-10-10 05:42:25,234 INFO [node1:60000.activeMasterManager] quotas.MasterQuotaManager: Quota support disabled
可见,node1节点上Backup Master 已经结果HMaster,成为Active HMaster
重新启动master节点上的hmaster
[[email protected] bin]$ ./hbase-daemon.sh start master starting master, logging to /usr/local/hbase//logs/hbase-hbase-master-master.out [[email protected] bin]$ jps 25188 NameNode 32351 Jps 3319 QuorumPeerMain 32265 HMaster 25595 ResourceManager 25711 NodeManager 25303 DataNode 31220 HRegionServer
查看master节点的log发现,它变为了backup master
[[email protected] logs]$ tail -f hbase-hbase-master-master.log 2015-10-10 05:53:15,329 INFO [main] mortbay.log: Started [email protected]0.0.0.0:60010 2015-10-10 05:53:15,333 INFO [main] master.HMaster: hbase.rootdir=hdfs://master:9000/hbase, hbase.cluster.distributed=true 2015-10-10 05:53:15,348 INFO [main] master.HMaster: Adding backup master ZNode /hbase/backup-masters/master,60000,1444456393819 2015-10-10 05:53:15,488 INFO [master:60000.activeMasterManager] master.ActiveMasterManager: Another master is the active master, node1,60000,1444455307700; waiting to become the next active master 2015-10-10 05:53:15,522 INFO [master/master/10.0.52.144:60000] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x323b7deb connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181 2015-10-10 05:53:15,522 INFO [master/master/10.0.52.144:60000] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=hconnection-0x323b7deb0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase 2015-10-10 05:53:15,524 INFO [master/master/10.0.52.144:60000-SendThread(master:2181)] zookeeper.ClientCnxn: Opening socket connection to server master/10.0.52.144:2181. Will not attempt to authenticate using SASL (unknown error) 2015-10-10 05:53:15,525 INFO [master/master/10.0.52.144:60000-SendThread(master:2181)] zookeeper.ClientCnxn: Socket connection established to master/10.0.52.144:2181, initiating session 2015-10-10 05:53:15,536 INFO [master/master/10.0.52.144:60000-SendThread(master:2181)] zookeeper.ClientCnxn: Session establishment complete on server master/10.0.52.144:2181, sessionid = 0x150463058ac001c, negotiated timeout = 40000 2015-10-10 05:53:15,567 INFO [master/master/10.0.52.144:60000] regionserver.HRegionServer: ClusterId : c309a039-eb35-400c-bb13-0b6ed939cc5e
时间: 2024-09-30 17:36:12