现象: 集群大面积异常,通过ambari启动不起来。逐一排查,顺序 hdfs -> mapreduce->yarn->hive -other
hdfs下发现namenode ,datanode启动不起来
namenode报错如下 【namenode.NameNode: Failed to start namenode. java.io.IOException: Gap in tra】
解决方案:
step1: /usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs namenode 让错误报出来
step2: namenode 格式化 : /usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs namenode -format
step3 :比对 clusterId : master namenode下的 current/version 和 datanode下的 current/version (多台机器) ,手动修改 datanode下的
clusterId: eg:[CID-e341356d-7657-48eb-b22e-3ab1f6771cd1]
/mnt/hadoop/hdfs/namenode/current/VERSION
/mnt/hadoop/hdfs/data/current/VERSION
step4: ambari上手动重启 namenode ,datanode
----------------分割线---------
常用命令,和手动重启
【设置任务类型:】
set hive.execution.engine=tez;
【Hive debug模式】
hive --hiveconf hive.root.logger=DEBUG,console
【yarn上杀死任务】
yarn application -kill application_1478856791630_0002
【resourcemanager手动启停】
/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh stop resourcemanager
/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start resourcemanager
【nodemanager手动启停】
/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh stop nodemanager
/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager
【yarn historyserver 重启】
/usr/hdp/current/hadoop-mapreduce-historyserver/sbin/mr-jobhistory-daemon.sh stop historyserver
/usr/hdp/current/hadoop-mapreduce-historyserver/sbin/mr-jobhistory-daemon.sh start historyserver
【yarn ha状态互转】
yarn rmadmin -getServiceState rm1
yarn rmadmin -transitionToStandby rm1 --forcemanual
yarn rmadmin -transitionToActive rm2 --forcemanual
【zookeper手动起停】
/usr/hdp/current/zookeeper-server/bin/zkServer.sh stop
/usr/hdp/current/zookeeper-server/bin/zkServer.sh start
【namenode手动启停】
/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs namenode
【datanode 手动启停】
/usr/hdp/current/hadoop-hdfs-datanode/bin/hdfs datanode