由于无法读取database,zookeeper启动失败

最近我们的集群状态异常,发现启动zk的时候一直失败,看了日志是真的心酸,5s挂机,这要是在开黑,分分钟被举报了,一开始真的没想懂启动zk为什么这么难,而且很稳定的在5s左右失败,通过时间可以判断这个进程还没有完全启动,在这个时间段内,顶多就是在init状态

[[email protected] deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-03-23 11:29:21 CST; 4s ago
     Docs: http://zookeeper.apache.org
  Process: 31011 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 31138 (java)
   CGroup: /system.slice/zookeeper.service
           └─31138 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo...

Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service...
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: ZooKeeper JMX enabled by default
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.
[[email protected] deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-03-23 11:29:21 CST; 5s ago
     Docs: http://zookeeper.apache.org
  Process: 31011 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 31138 (java)
   CGroup: /system.slice/zookeeper.service
           └─31138 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo...

Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service...
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: ZooKeeper JMX enabled by default
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.
[[email protected] deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2020-03-23 11:29:26 CST; 706ms ago
     Docs: http://zookeeper.apache.org
  Process: 31225 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 31138 (code=exited, status=1/FAILURE)

Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: zookeeper.service: main process exited, code=exited, status=1/FAILURE
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: ZooKeeper JMX enabled by default
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: Stopping zookeeper ... /opt/zookeeper/zookeeper-prod/bin/zkServer.sh: 第 182 行:kill: (31138) - 没有那个进程
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: STOPPED
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Unit zookeeper.service entered failed state.
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: zookeeper.service failed.

看到日志中提到/opt/zookeeper/zookeeper-prod/conf/zoo.cfg所以就去这个目录下看看有没有什么值得挖掘的内容。毕竟conf可以猜测是一个配置文件夹,讲道理应该是有log或者output之类的文件夹,里面存放着运行日志,特别是error日志,按照这个思路就可以进行相关的排查了。
之后发现/opt/zookeeper/zookeeper-prod/bin目录中有个zookeeper.out文件,这个是执行的细节,可以看下内容,然后cat一下,问题就很明朗了

2020-03-23 11:36:58,799 [myid:] - INFO  [main:[email protected]] - Reading configuration from: /opt/zookeeper/zookeeper-prod/bin/../conf/zoo.cfg
2020-03-23 11:36:58,814 [myid:] - INFO  [main:[email protected]] - Resolved hostname: 10.153.115.26 to address: /10.153.115.26
2020-03-23 11:36:58,815 [myid:] - INFO  [main:[email protected]] - Resolved hostname: 10.153.115.25 to address: /10.153.115.25
2020-03-23 11:36:58,816 [myid:] - INFO  [main:[email protected]] - Resolved hostname: 10.153.115.24 to address: /10.153.115.24
2020-03-23 11:36:58,816 [myid:] - INFO  [main:[email protected]] - Resolved hostname: 10.153.115.29 to address: /10.153.115.29
2020-03-23 11:36:58,816 [myid:] - INFO  [main:[email protected]] - Resolved hostname: 10.153.115.28 to address: /10.153.115.28
2020-03-23 11:36:58,816 [myid:] - INFO  [main:[email protected]] - Resolved hostname: 10.153.115.27 to address: /10.153.115.27
2020-03-23 11:36:58,816 [myid:] - WARN  [main:[email protected]] - Non-optimial configuration, consider an odd number of servers.
2020-03-23 11:36:58,816 [myid:] - INFO  [main:[email protected]] - Defaulting to majority quorums
2020-03-23 11:36:58,821 [myid:5] - INFO  [main:[email protected]] - autopurge.snapRetainCount set to 3
2020-03-23 11:36:58,821 [myid:5] - INFO  [main:[email protected]] - autopurge.purgeInterval set to 24
2020-03-23 11:36:58,822 [myid:5] - INFO  [PurgeTask:[email protected]] - Purge task started.
2020-03-23 11:36:58,837 [myid:5] - INFO  [PurgeTask:[email protected]] - Purge task completed.
2020-03-23 11:36:58,839 [myid:5] - INFO  [main:[email protected]] - Starting quorum peer
2020-03-23 11:36:58,849 [myid:5] - INFO  [main:[email protected]] - Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
2020-03-23 11:36:58,856 [myid:5] - INFO  [main:[email protected]] - binding to port 0.0.0.0/0.0.0.0:2181
2020-03-23 11:36:58,861 [myid:5] - INFO  [main:[email protected]] - tickTime set to 2000
2020-03-23 11:36:58,861 [myid:5] - INFO  [main:[email protected]] - initLimit set to 10
2020-03-23 11:36:58,861 [myid:5] - INFO  [main:[email protected]] - minSessionTimeout set to -1
2020-03-23 11:36:58,862 [myid:5] - INFO  [main:[email protected]] - maxSessionTimeout set to -1
2020-03-23 11:36:58,871 [myid:5] - INFO  [main:[email protected]] - QuorumPeer communication is not secured!
2020-03-23 11:36:58,871 [myid:5] - INFO  [main:[email protected]] - quorum.cnxn.threads.size set to 20
2020-03-23 11:36:58,872 [myid:5] - INFO  [main:[email protected]] - Reading snapshot /data/zookeeper/data/version-2/snapshot.b91d0000003c
2020-03-23 11:36:59,290 [myid:5] - ERROR [main:[email protected]] - Unable to load database on disk
java.io.IOException: The accepted epoch, ba86 is less than the current epoch, ba87
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:689)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2020-03-23 11:36:59,292 [myid:5] - ERROR [main:[email protected]] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:693)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: The accepted epoch, ba86 is less than the current epoch, ba87
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:689)
    ... 4 more

日志中写着,正在读取zk的快照,然后就报错了,无法载入磁盘上的数据库,那么!我就把快照删了snapshot.b91d0000003c,让其自己重新生成快照文件,完事了。

2020-03-23 11:36:58,872 [myid:5] - INFO  [main:[email protected]] - Reading snapshot /data/zookeeper/data/version-2/snapshot.b91d0000003c
2020-03-23 11:36:59,290 [myid:5] - ERROR [main:[email protected]] - Unable to load database on disk

酸爽

[[email protected] deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-03-23 12:12:08 CST; 5min ago
     Docs: http://zookeeper.apache.org
  Process: 25348 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 25658 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 25667 (java)
   CGroup: /system.slice/zookeeper.service
           └─25667 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo...

Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service...
Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: ZooKeeper JMX enabled by default
Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 12:12:08 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: Starting zookeeper ... STARTED
Mar 23 12:12:08 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.

因为zk每次运行的时候都会有个一快照文件,这个是状态恢复用的,由于这台主机之前磁盘满了,导入zk无法及时写入消息,之后我们将设备进行重启,应该就是这个时候,快照文件写入失败导致的。不过这个也不是万能解法,也是要具体问题具体分析,因为在集群中,把快照文件删了,后续想恢复数据库状态就难了,不过幸亏我们的zk是6节点的,另外五个节点正常,所以这么操作是允许的。
至此收工:)

原文地址:https://blog.51cto.com/yerikyu/2481123

时间: 2024-10-06 13:29:52

由于无法读取database,zookeeper启动失败的相关文章

zookeeper启动失败

集群中3个节点,第一个started,第二个说Starting zookeeper ... already running as process xxxx,第三个说Starting zookeeper ... already running as process yyyy. 原来是相应目录下残留的pid文件导致启动失败,打开那个文件一看里边的进程号正是提示出现的xxxx,yyyy.删除他们后重启,好了. zookeeper启动失败

zookeeper启动失败排查

最近开始实践搭建Linux下的集群环境,在搭建zookeeper的时候,出现了启动失败的情况,介绍下几种情况和解决方法. 首先,强烈建议新手刚开始搭建的时候关掉防火墙,否则可能出现很多奇怪的错误令人心烦. 关闭防火墙并且禁止开机启动 systemctl stop firewalld.service systemctl disable firewalld.service 1.java.net.BindException: 地址已在使用 端口被占用了,这种情况下只需要把2181端口改成别的就可以了

Zookeeper启动失败,报错 can not open chanel to 2

zookeeper 3.4.8 安装在 7 台不同的虚拟机上,配置文件如下: tickTime=2000 initLimit=10 syncLimit=5 dataDir=/var/zookeeper clientPort=2181 server.1=master1:2888:3888 server.2=master2:2888:3888 server.3=slave1:2888:3888 server.4=slave2:2888:3888 server.5=slave3:2888:3888 s

zookeeper启动失败解决方法

今天和往常一样打开虚拟机启动zookeeper时报一下错误: 看了一下,原来是没用root权限登陆... 只需更改为su用户再次启动即可 最后, 由于使用不多,接触少,如有错漏的地方欢迎指出.批评,多谢! 原文地址:https://www.cnblogs.com/panshu-1234/p/9821303.html

zookeeper启动失败无法查看status-----用户权限

最近一直在调试zookeeper,总是出现莫名其妙的问题 QuorumPeerMain 进程存在,但是无法查看status, JMX enabled by defaultUsing config: /data/programfiles/zookeeper-3.4.5/bin/../conf/zoo.cfgError contacting service. It is probably not running. 查看了网上的一些方法,感觉都不是太符合,最后注意到是权限的问题,zookeeper下的

ORA-01078和LRM-00109问题导致ORACLE启动失败解决方法

操作环境 SuSE11 + ORACLE11gR2(11.2.0.3) 问题现象 新安装ORACLE启动失败,提示ORA-01078和LRM-00109错误.具体错误现象如下 SQL> startup ORA-01078: failure in processing system parameters LRM-00109: could not open parameter file '/home/oracle/base/dbs/initora11g.ora'  问题分析 根据错误分析是查找不到参

oem启动失败

尝试启动em管理器 [[email protected] ~]$ emctl start dbconsole TZ set to Asia/Chungking Oracle Enterprise Manager 10g DatabaseControl Release 10.2.0.5.0 Copyright (c) 1996, 2010 OracleCorporation.  All rights reserved. https://linux5:1158/em/console/aboutApp

MySQL Study之--Mysql启动失败“mysql.host”

MySQL Study之--Mysql启动失败"mysql.host" 系统环境: 操作系统:RedHat EL55 DB Soft:  Mysql 5.6.4-m7 通过源码包安装mysql后,在启动mysqld时出现错误: [[email protected] mysql]# bin/mysqld_safe &[1] 15846[[email protected] mysql]# 150610 17:04:36 mysqld_safe Logging to '/usr/lo

CentOS 7下MySQL服务启动失败的解决思路

今天,启动MySQL服务器失败,如下所示: [[email protected] ~]# /etc/init.d/mysqld start Starting mysqld (via systemctl): Job for mysqld.service failed because the control process exited with error code. See "systemctl status mysqld.service" and "journalctl -