YARN Apache Hadoop 的下一代MapReduce

之前自己做的hadoop项目是基于0.20.2版本的,查了一下资料,知道了自己以前学的是原map/reduce模型。

官方说明:

1.1.X - current stable version, 1.1 release

1.2.X - current beta version, 1.2 release

2.X.X - current alpha version

0.23.X - simmilar to 2.X.X but missing NN HA.

0.22.X - does not include security

0.20.203.X - old legacy stable version

0.20.X - old legacy version

说明:

0.20/0.22/1.1/CDH3 系列,原map/reduce模型,稳定版

0.23/2.X/CDH4系列,Yarn模型,新版

再一次打开hadoop官网,准备尝试翻译一下介绍YARN这个章节的内容,学习知识的同时提高外语能力。

YARN  Apache Hadoop 的下一代MapReduce

在hadoop-0.23版本中, MapReduce已经做了一次全面的修改,这也正是我们现在所说的 MapReduce 2.0 (MRv2) 或者是 YARN.

MRv2的基本思想是将JobTracker的两个主要的功能,一个是资源管理,一个是作业的调度和监控,分成各自独立的后台进程。这个思想说的是拥有一个全局的资源管理器( ResourceManager (RM)),还有一个是每个应用程序都拥有的应用主控器(ApplicationMaster (AM))。一个应用程序可以是一个传统的Map-Reduce作业集合中一个单独的作业或者也可以是一个有向无环图的作业集合。

资源管理器(ResourceManager),每个节点的从属模块(per-node slave)和节点管理器( NodeManager (NM))形成了这个数据计算框架。资源管理器在整个系统中的所有应用程序中对资源具有最高的决定控制权。

每一个应用程序的应用主控器(ApplicationMaster)实际上是一个框架特定的类库,负责处理来自于资源管理器的资源并且协同节点管理器(NodeManager)去执行和监控作业。

资源管理器有两个主要的组件: 调度器(Scheduler), 应用集合管理器(ApplicationsManager)

调度器负责把资源分配给各种正在运行的应用程序,这些应用程序服从着容量,队列等相似的约束。调度器执行过程中不会监控和追踪应用程序的状态,从这一点上来看,它仅仅是一个单纯的调度器。同时,它也不保证一定会重启由于应用错误或者是硬件故障引起的失败的任务。调度器根据应用程序的资源的需求执行它的调度功能。其实它根据的是一个叫做资源容器(Container)的抽象概念,这个容器包含了例如内存,cpu,磁盘,网络等元素。在第一个版本中,只支持内存。

调度器拥有一个支持可插拔策略的插件程序,它负责分割存在于所有的各种队列,应用程序等等的集群资源。当前的Map-Reduce 调度器例如容量调度器(CapacityScheduler)和公平调度器(FairScheduler)是一些插件程序的典型例子。

容量调度器支持分级队列(hierarchical queues),目的是去允许集群资源的更多可预测的共享。

应用集合管理器负责接收作业提交,协商确定第一个容器,用于执行应用程序中特定的应用主控器以及为由于某种原因而失败的应用主控器容器的重启提供服务。

节点管理器是每一台计算机的框架代理,它负责各种容器,监控他们的资源使用情况(cpu,内存,磁盘,网络等),并且把相同的信息传递给资源管理器或者是调度器。

每一个应用程序的应用主控器负责协商确定合适的来自于调度器的资源容器,并且为进程追踪和监控他们的状态。

下一代MapReduce MRV2 和之前的稳定版本hadoop-1.x保持良好的API 兼容性。这样一来,就是说所有的Map-Reduce 作业不需要改变,只要重新编译一下就可以在MRv2上完美运行了。

官网原文,如有翻译不妥处,请多多指教,谢谢!

Apache Hadoop NextGen MapReduce (YARN)

MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.

The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to
have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources
among all the applications in the system.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s)
to execute and monitor the tasks.

The ResourceManager has two main components: Scheduler and ApplicationsManager.

The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler
in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based
the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc. In the first version, only memory is supported.

The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers
such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.

The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources

The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service
for restarting the ApplicationMaster container on failure.

The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just
a recompile.

时间: 2024-10-28 00:49:13

YARN Apache Hadoop 的下一代MapReduce的相关文章

下一代Apache Hadoop MapReduce框架的架构

背景 随着集群规模和负载增加,MapReduce JobTracker在内存消耗,线程模型和扩展性/可靠性/性能方面暴露出了缺点,为此需要对它进行大整修. 需求 当我们对Hadoop MapReduce框架进行改进时,需要时刻谨记的一个重要原则是用户的需求.近几年来,从Hadoop用户那里总结出MapReduce框架当前最紧迫的需求有: (1)可靠性(Reliability)– JobTracker不可靠 (2)可用性(Availability)– JobTracker可用性有问题 (3) 扩展

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 .mobi: http://www.t00y.com/file/79497801 Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2.pdf: http://www.t00y.com/file/8034244

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/util/Apps Hadoop2.6.0编程问题与解决

从hadoop 1.2.1升级到 Hadoop2.6.0,调试写代码,还是遇到一些问题的.这里记录一下,后续如果自己再遇到类似问题,那也好找原因了. 在eclipse里编译运行 WordCount,出现以下错误. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/util/Apps at java.lang.ClassLoader.defineClass1(Native M

Apache Hadoop YARN: 背景及概述

从2012年8月开始Apache Hadoop YARN(YARN = Yet Another Resource Negotiator)成了Apache Hadoop的一项子工程.自此Apache Hadoop由下面四个子工程组成: Hadoop Comon:核心库,为其他部分服务 Hadoop HDFS:分布式存储系统 Hadoop MapReduce:MapReduce模型的开源实现 Hadoop YARN:新一代Hadoop数据处理框架 概括来说,Hadoop YARN的目的是使得Hado

hadoop错误org.apache.hadoop.yarn.exceptions.YarnException Unauthorized request to start container

错误: 14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception:     org.apache

# Apache Hadoop Yarn: Yet Another Resource Negotiator论文解读

纯属云平台管理学习菜鸟的笔记,参照许多大牛的博客,如有侵权,请联系,立刻删除. Abstract 1) tight coupling of a specific programming model with the re- source management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' contro

Apache Hadoop YARN

1. Yarn通俗介绍Apache Hadoop YARN (Yet Another Resource Negotiator,另一种资源协调者)是一种新的 Hadoop 资源管理器,它是一个通用资源管理系统和调度平台,可为上层应用提供统一的资源管理和调度,它的引入为集群在利用率.资源统一管理和数据共享等方面带来了巨大好处.可以把yarn理解为相当于一个分布式的操作系统平台,而mapreduce等运算程序则相当于运行于操作系统之上的应用程序,Yarn为这些程序提供运算所需的资源(内存.cpu).l

org.apache.hadoop.yarn.conf.ConfigurationProviderFactory分析加载配置文件两种方式

ConfigurationProviderFactory结构如下: /** * Creates an instance of {@link ConfigurationProvider} using given * configuration. * @param bootstrapConf * @return configurationProvider */ @SuppressWarnings("unchecked") public static ConfigurationProvide

Apache hadoop namenode ha和yarn ha ---HDFS高可用性

HDFS高可用性Hadoop HDFS 的两大问题:NameNode单点:虽然有StandbyNameNode,但是冷备方案,达不到高可用--阶段性的合并edits和fsimage,以缩短集群启动的时间--当NameNode失效的时候,Secondary NN并无法立刻提供服务,Secondary NN甚至无法保证数据完整性--如果NN数据丢失的话,在上一次合并后的文件系统的改动会丢失NameNode扩展性问题:单NameNode元数据不可扩展,是整个HDFS集群的瓶颈 Hadoop HDFS高