ResourceManger Restart

https://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

ResourceManger Restart

Overview

综述

ResourceManager is the central authority that manages resources and schedules applications running atop of YARN. Hence, it is potentially a single point of failure in a Apache YARN cluster.

ResourceManager 是管理资源和安排应用运行于YARN之上的中心机构;因此,在YARN集群中它是一个潜在的单点故障。

This document gives an overview of ResourceManager Restart, a feature that enhances ResourceManager to keep functioning across restarts and also makes ResourceManager down-time invisible to end-users.

本文概述了ResourceManager重启,这个特性能够使得RM在重启过程继续提供功能,并使得RM宕机时间对终端用户透明;

ResourceManager Restart feature is divided into two phases:

RM重启特性包括两个阶段:

ResourceManager Restart Phase 1: Enhance RM to persist application/attempt state and other credentials information in a pluggable state-store. RM will reload this information from state-store upon restart and re-kick the previously running applications. Users are not required to re-submit the applications.

RM重启阶段1:提高RM持久化应用/尝试状态和其他凭证信息,存储在一个可插拔的state-store中。RM将从state-store重载这些信息用于重启和反踢以前的运行的应用。用户无需重新提交这些应用。

ResourceManager Restart Phase 2: Focus on re-constructing the running state of ResourceManger by reading back the container statuses from NodeMangers and container requests from ApplicationMasters upon restart. The key difference from phase 1 is that previously running applications will not be killed after RM restarts, and so applications won‘t lose its work because of outage.

RM重启阶段2:聚焦重建运行的RM状态,通过从NodeMangers中读回container状态和从appMaster中读回在重启过程中container的请求;与第一阶段的关键区别是,以前运行的应用在RM重启后不会被kill,因此在RM运行中断过程中应用不会丢失。

As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below.

从hadoop2.4发行版起,只有RM重启阶段1被实现了,如下所述。

Feature

The overall concept is that RM will persist the application metadata (i.e. ApplicationSubmissionContext) in a pluggable state-store when client submits an application and also saves the final status of the application such as the completion state (failed, killed, finished) and diagnostics when the application completes. Besides, RM also saves the credentials like security keys, tokens to work in a secure environment. Any time RM shuts down, as long as the required information (i.e.application metadata and the alongside credentials if running in a secure environment) is available in the state-store, when RM restarts, it can pick up the application metadata from the state-store and re-submit the application. RM won‘t re-submit the applications if they were already completed (i.e. failed, killed, finished) before RM went down.

整体的概念是当客户端提交应用程序并保存应用程序的最终状态(例如完成状态【failed, killed, finished】和应用程序完成时的诊断信息),RM将持久化应用的元数据(例如ApplicationSubmissionContext)在一个可插拔的state-store中。此外,在保密的环境中RM也保存凭证,例如安全密钥,令牌。任何时候RM宕机,只要存储在state-store中的信息(例如应用程序元数据和凭证【如果运行在一个安全的环境中】)是可用的,RM重启时,它可以从state-store中获取应用元数据并重新提交应用;在RM回落前RM不会重新提交那些已经完成【例如 failed, killed, finished】的应用程序。

NodeMangers and clients during the down-time of RM will keep polling RM until RM comes up. When RM becomes alive, it will send a re-sync command to all the NodeMangers and ApplicationMasters it was talking to via heartbeats. Today, the behaviors for NodeMangers and ApplicationMasters to handle this command are: NMs will kill all its managed containers and re-register with RM. From the RM‘s perspective, these re-registered NodeManagers are similar to the newly joining NMs. AMs(e.g. MapReduce AM) today are expected to shutdown when they receive the re-sync command. After RM restarts and loads all the application metadata, credentials from state-store and populates them into memory, it will create a new attempt (i.e. ApplicationMaster) for each application that was not yet completed and re-kick that application as usual. As described before, the previously running applications‘ work is lost in this manner since they are essentially killed by RM via the re-sync command on restart.

Configurations

This section describes the configurations involved to enable RM Restart feature.

  • Enable ResourceManager Restart functionality.

    To enable RM Restart functionality, set the following property in conf/yarn-site.xml to true:

    Property Value
    yarn.resourcemanager.recovery.enabled true
  • Configure the state-store that is used to persist the RM state.
    Property Description
    yarn.resourcemanager.store.class The class name of the state-store to be used for saving application/attempt state and the credentials. The available state-store implementations are org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore , a ZooKeeper based state-store implementation and org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore , a Hadoop FileSystem based state-store implementation like HDFS. The default value is set to org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.
    • Configurations when using Hadoop FileSystem based state-store implementation.

      Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store.

      Property Description
      yarn.resourcemanager.fs.state-store.uri URI pointing to the location of the FileSystem path where RM state will be stored (e.g. hdfs://localhost:9000/rmstore). Default value is $hadoop.tmp.dir/yarn/system/rmstore. If FileSystem name is not provided, fs.default.name specified in conf/core-site.xml will be used.

      Configure the retry policy state-store client uses to connect with the Hadoop FileSystem.

      Property Description
      yarn.resourcemanager.fs.state-store.retry-policy-spec Hadoop FileSystem client retry policy specification. Hadoop FileSystem client retry is always enabled. Specified in pairs of sleep-time and number-of-retries i.e. (t0, n0), (t1, n1), ..., the first n0 retries sleep t0 milliseconds on average, the following n1 retries sleep t1 milliseconds on average, and so on. Default value is (2000, 500)
    • Configurations when using ZooKeeper based state-store implementation.

      Configure the ZooKeeper server address and the root path where the RM state is stored.

      Property Description
      yarn.resourcemanager.zk-address Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server (e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state.
      yarn.resourcemanager.zk-state-store.parent-path The full path of the root znode where RM state will be stored. Default value is /rmstore.

      Configure the retry policy state-store client uses to connect with the ZooKeeper server.

      Property Description
      yarn.resourcemanager.zk-num-retries Number of times RM tries to connect to ZooKeeper server if the connection is lost. Default value is 500.
      yarn.resourcemanager.zk-retry-interval-ms The interval in milliseconds between retries when connecting to a ZooKeeper server. Default value is 2 seconds.
      yarn.resourcemanager.zk-timeout-ms ZooKeeper session timeout in milliseconds. This configuration is used by the ZooKeeper server to determine when the session expires. Session expiration happens when the server does not hear from the client (i.e. no heartbeat) within the session timeout period specified by this configuration. Default value is 10 seconds

      Configure the ACLs to be used for setting permissions on ZooKeeper znodes.

      Property Description
      yarn.resourcemanager.zk-acl ACLs to be used for setting permissions on ZooKeeper znodes. Default value is world:anyone:rwcda
  • Configure the max number of application attempt retries.
    Property Description
    yarn.resourcemanager.am.max-attempts The maximum number of application attempts. It‘s a global setting for all application masters. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound. If it is, the RM will override it. The default number is set to 2, to allow at least one retry for AM.

    This configuration‘s impact is in fact beyond RM restart scope. It controls the max number of attempts an application can have. In RM Restart Phase 1, this configuration is needed since as described earlier each time RM restarts, it kills the previously running attempt (i.e. ApplicationMaster) and creates a new attempt. Therefore, each occurrence of RM restart causes the attempt count to increase by 1. In RM Restart phase 2, this configuration is not needed since the previously running ApplicationMaster will not be killed and the AM will just re-sync back with RM after RM restarts.

时间: 2024-10-05 04:45:05

ResourceManger Restart的相关文章

yarn 单点故障 重启 ResourceManger Restart

http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html Feature Phase 1: Non-work-preserving RM restart As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below. Th

对Hadoop2.7.2文档的学习-Yarn部分(3)RM Restart/RM HA/Timeline Server/NM Restart

ResourceManger Restart ResourceManager负责资源管理和应用的调度,是YARN的核心组件,有可能存在单点失败的问题.ResourceManager Restart是使RM在重启动时能够使Yarn集群正常工作的feature,并且使RM的出现的失败不被用户知道. ResourceManager Restart feature is divided into two phases: ResourceManager Restart Phase 1 (Non-work-

yarn资源管理器高可用性的实现

资源管理器高可用性 . The ResourceManager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs). Prior to Hadoop 2.4, the ResourceManager is the single point of failure in a YARN cluster. The High Avail

ResourceManager High Availability

Apache 官方原文地址:http://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html 一 简介 二 架构 1 RM 切换 11 手工故障切换 22 自动故障切换 23 在 RM 故障切换中的客户端ApplicationMaster 和 NodeManager 2 恢复之前 Active-RM 的状态 三 部署 1 配置 11 配置样例 2 管理命令 3 ResourceMana

hadoop2.6.0汇总:新增功能最新编译 32位、64位安装、源码包、API下载及部署文档

相关内容: hadoop2.5.2汇总:新增功能最新编译 32位.64位安装.源码包.API.eclipse插件下载Hadoop2.5 Eclipse插件制作.连接集群视频.及hadoop-eclipse-plugin-2.5.0插件下载hadoop2.5.1汇总:最新编译 32位.64位安装.源码包.API下载及新特性等 新手指导:hadoop官网介绍及如何下载hadoop(2.4)各个版本与查看hadoop API介绍 从零教你在Linux环境下(ubuntu 12.04)如何编译hadoo

Hadoop官方文档翻译—— YARN ResourceManager High Availability 2.7.3

ResourceManager High Availability (RM高可用) Introduction(简介) Architecture(架构) RM Failover(RM 故障切换) Recovering prevous active-RM's state(恢复之前活动的RM的状态) Deployment(部署) Configurations(配置) Admin commands(管理命令) ResourceManager Web UI services(RM Web UI服务) We

【Hadoop学习】Apache Hadoop ResourceManager HA

简介 本向导简述了YARN资源管理器的HA,并详述了如何配置并使用该特性.RM负责追踪集群中的资源,并调度应用程序(如MapReduce作业).Hadoop2.4以前,RM是YARN集群中的单点故障.HA特性以Active/Standby RM对的形式对集群添加了冗余,从而消除了这种单点故障. 架构 RM故障恢复 RM HA是通过Active/Standby架构来实现的——任何时刻,有一个RM是活跃的(active),其他RM处于等待模式(Standby),等待当前活跃RM发生故障时可以接管其工

Apache Hadoop 2.6.0 新特性

Apache Hadoop 2.6.0发布了,新的稳定版,发布频率和质量越来越高了,增加了很多东西,从安装包的大小就能看出来,直接增加了50M,30%有木有. 下面看一下,2.6.0都有啥好东西. Common: 1.      Hadoop Key Management Server(KMS)是一个基于HadoopKeyProvider API编写的密钥管理服务器.他提供了一个client和一个server组件,client和server之间基于HTTP协议使用REST API通信.Clien

mysqld服务启动失败, Failed to restart mysqld.service: Unit not found.

-bash-4.2# service mysqld restart Redirecting to /bin/systemctl restart mysqld.serviceFailed to restart mysqld.service: Unit not found. 并不存在 mysqld 的服务, -bash-4.2# -bash-4.2# chkconfig -list -list: unknown option -bash-4.2# chkconfig --list Note: Thi