前言
当一个应用向YARN集群提交作业后,此作业的多个任务由于负载不均衡、资源分布不均等原因都会导致各个任务运行完成的时间不一致,甚至会出现一个任务明显慢于同一作业的其它任务的情况。如果对这种情况不加优化,最慢的任务最终会拖慢整个作业的整体执行进度。好在mapreduce框架提供了任务推断执行机制,当有必要时就启动一个备份任务。最终会采用备份任务和原任务中率先执行完的结果作为最终结果。
由于具体分析推断执行机制,篇幅很长,所以我会分成几篇内容陆续介绍。
推断执行测试
本文在我自己搭建的集群(集群搭建可以参阅《Linux下Hadoop2.6.0集群环境的搭建》一文)上,执行wordcount例子,来验证mapreduce框架的推断机制。我们输入以下命令:
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount -D mapreduce.input.fileinputformat.split.maxsize=19 /wordcount/input/ /wordcount/output/result1
任务划分的信息如下:
我们看到map任务被划分为10:
这一次测试并没有发生推断执行的情况,我们可以再执行一次上面厄命令,最后看到的信息如下:
其中看到执行的map数量多了1个,成了11个,而且还出现了Killed map tasks=1的信息,这表示这次执行最终发生了推断执行。其中一个map任务增加了一个备份任务,当备份的map任务和原有的map任务中有一个率先完成了,那么会将另一个慢的map任务杀死。reduce任务也是类似,只不过Hadoop2.6.0的子项目hadoop-mapreduce-examples中自带的wordcount例子中,使用Job.setNumReduceTasks(int)这个API将reduce任务的数量控制为1个。在这里我们看到推断执行有时候会发生,而有时候却不会,这是为什么呢?
在《Hadoop2.6.0运行mapreduce之Uber模式验证》一文中我还简短提到,如果启用了Uber运行模式,推断执行会被取消。关于这些内部的实现原理需要我们从架构设计和源码角度进行剖析,因为我们还需要知道所以然。
mapreduce推断执行设计架构
在mapreduce中设计了Speculator接口作为推断执行的统一规范,DefaultSpeculator作为一种服务在实现了Speculator的同时继承了AbstractService,DefaultSpeculator是mapreduce的默认实现。DefaultSpeculator负责处理SpeculatorEvent事件,目前主要包括四种事件,分别是:
- JOB_CREATE:作业刚刚被创建时触发的事件,并处理一些初始化工作。
- ATTEMPT_START:一个任务实例TaskAttemptImpl启动时触发的事件,DefaultSpeculator将会使用内部的推断估算器(默认是LegacyTaskRuntimeEstimator)开启对此任务实例的监控。
- ATTEMPT_STATUS_UPDATE:当任务实例的状态更新时触发的事件,DefaultSpeculator将会更新推断估算器对任务的监控信息;更新正在运行中的任务(维护在runningTasks中);任务的统计信息(这些统计信息用于跟踪长时间未汇报心跳的任务,并积极主动的进行推断执行,而不是等待任务超时)
- TASK_CONTAINER_NEED_UPDATE:任务Container数量发生变化时触发的事件。
TaskRuntimeEstimator接口为推断执行提供了计算模型的规范,默认的实现类是LegacyTaskRuntimeEstimator,此外还有ExponentiallySmoothedTaskRuntimeEstimator。这里暂时不对其内容进行深入介绍,在后面会陆续展开。
Speculator的初始化和启动伴随着MRAppMaster的初始化与启动。
接下来我们以Speculator接口的默认实现DefaultSpeculator为例,逐步分析其初始化、启动、推断执行等内容的工作原理。
Speculator的初始化
Speculator是MRAppMaster的子组件、自服务,所以也需要初始化。有经验的Hadoop工程师,想必知道当mapreduce作业提交给ResourceManager后,由RM负责向NodeManger通信启动一个Container用于执行MRAppMaster。启动MRAppMaster实际也是通过调用其main方法,其中会调用MRAppMaster实例的serviceInit方法,其中与Speculator有关的代码实现见代码清单1。
代码清单1 MRAppMaster的serviceInit方法中创建Speculator的代码
if (conf.getBoolean(MRJobConfig.MAP_SPECULATIVE, false) || conf.getBoolean(MRJobConfig.REDUCE_SPECULATIVE, false)) { //optional service to speculate on task attempts‘ progress speculator = createSpeculator(conf, context); addIfService(speculator); } speculatorEventDispatcher = new SpeculatorEventDispatcher(conf); dispatcher.register(Speculator.EventType.class, speculatorEventDispatcher);
代码清单1所示代码的执行步骤如下:
- 当启用map任务推断(这里的MRJobConfig.MAP_SPECULATIVE实际由参数mapreduce.map.speculative控制,默认是true)或者启用reduce任务推断(这里的MRJobConfig.REDUCE_SPECULATIVE实际由参数mapreduce.reduce.speculative控制,默认是true)时调用createSpeculator方法创建推断服务。最后将Speculator添加为MRAppMaster的子服务。
- 向调度器dispatcher注册推断事件与推断事件的处理器SpeculatorEventDispatcher,以便触发了推断事件后交由SpeculatorEventDispatcher作进一步处理。
createSpeculator方法(见代码清单2)创建的推断服务的实现类默认是DefaultSpeculator,用户也可以通过参数yarn.app.mapreduce.am.job.speculator.class(即MRJobConfig.MR_AM_JOB_SPECULATOR)指定推断服务的实现类。
代码清单2 创建推断器
protected Speculator createSpeculator(Configuration conf, final AppContext context) { return callWithJobClassLoader(conf, new Action<Speculator>() { public Speculator call(Configuration conf) { Class<? extends Speculator> speculatorClass; try { speculatorClass // "yarn.mapreduce.job.speculator.class" = conf.getClass(MRJobConfig.MR_AM_JOB_SPECULATOR, DefaultSpeculator.class, Speculator.class); Constructor<? extends Speculator> speculatorConstructor = speculatorClass.getConstructor (Configuration.class, AppContext.class); Speculator result = speculatorConstructor.newInstance(conf, context); return result; } catch (InstantiationException ex) { LOG.error("Can‘t make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } catch (IllegalAccessException ex) { LOG.error("Can‘t make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } catch (InvocationTargetException ex) { LOG.error("Can‘t make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } catch (NoSuchMethodException ex) { LOG.error("Can‘t make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } } }); }
根据代码清单2,我们知道createSpeculator方法通过反射调用了DefaultSpeculator的构造器来实例化任务推断器。DefaultSpeculator的构造器如下:
public DefaultSpeculator(Configuration conf, AppContext context) { this(conf, context, context.getClock()); } public DefaultSpeculator(Configuration conf, AppContext context, Clock clock) { this(conf, context, getEstimator(conf, context), clock); }
上述第一个构造器调用了第二个构造器,而第二个构造器中首先调用了getEstimator方法(见代码清单3)用于获取推断估算器。
代码清单3 获取推断估算器
static private TaskRuntimeEstimator getEstimator (Configuration conf, AppContext context) { TaskRuntimeEstimator estimator; try { // "yarn.mapreduce.job.task.runtime.estimator.class" Class<? extends TaskRuntimeEstimator> estimatorClass = conf.getClass(MRJobConfig.MR_AM_TASK_ESTIMATOR, LegacyTaskRuntimeEstimator.class, TaskRuntimeEstimator.class); Constructor<? extends TaskRuntimeEstimator> estimatorConstructor = estimatorClass.getConstructor(); estimator = estimatorConstructor.newInstance(); estimator.contextualize(conf, context); } catch (InstantiationException ex) { LOG.error("Can‘t make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } catch (IllegalAccessException ex) { LOG.error("Can‘t make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } catch (InvocationTargetException ex) { LOG.error("Can‘t make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } catch (NoSuchMethodException ex) { LOG.error("Can‘t make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } return estimator; }
根据代码清单3可以看出推断估算器可以通过参数yarn.app.mapreduce.am.job.task.estimator.class(即MRJobConfig.MR_AM_TASK_ESTIMATOR)进行指定,如果没有指定,则默认使用LegacyTaskRuntimeEstimator。实例化LegacyTaskRuntimeEstimator后,还调用其父类StartEndTimesBase的contextualize方法(见代码清单4)进行上下文的初始化,实际就是将当前作业添加到map任务统计列表、reduce任务统计列表,并设置任务与其慢任务阈值(mapreduce.job.speculative.slowtaskthreshold)之间的映射关系。
代码清单4 StartEndTimesBase的初始化
@Override public void contextualize(Configuration conf, AppContext context) { this.context = context; Map<JobId, Job> allJobs = context.getAllJobs(); for (Map.Entry<JobId, Job> entry : allJobs.entrySet()) { final Job job = entry.getValue(); mapperStatistics.put(job, new DataStatistics()); reducerStatistics.put(job, new DataStatistics()); slowTaskRelativeTresholds.put (job, conf.getFloat(MRJobConfig.SPECULATIVE_SLOWTASK_THRESHOLD,1.0f)); } }
创建并初始化推断估算器后,会再次调用DefaultSpeculator的最后一个构造器,实现如下:
public DefaultSpeculator (Configuration conf, AppContext context, TaskRuntimeEstimator estimator, Clock clock) { super(DefaultSpeculator.class.getName()); this.conf = conf; this.context = context; this.estimator = estimator; this.clock = clock; this.eventHandler = context.getEventHandler(); }
至此,我们介绍完了DefaultSpeculator的初始化过程。
Speculator的启动
MRAppMaster在启动的时候,会调用其serviceStart方法,其中涉及启动Speculator的部分见代码清单5。
代码清单5 启动MRAppMaster时涉及Speculator的部分
if (job.isUber()) { speculatorEventDispatcher.disableSpeculation(); LOG.info("MRAppMaster uberizing job " + job.getID() + " in local container (\"uber-AM\") on node " + nmHost + ":" + nmPort + "."); } else { // send init to speculator only for non-uber jobs. // This won‘t yet start as dispatcher isn‘t started yet. dispatcher.getEventHandler().handle( new SpeculatorEvent(job.getID(), clock.getTime())); LOG.info("MRAppMaster launching normal, non-uberized, multi-container " + "job " + job.getID() + "."); }
分析代码清单5,其中与任务推断有关的逻辑如下:
- 如果采用了Uber运行模式,则会调用SpeculatorEventDispatcher的disableSpeculation方法(见代码清单6),使得任务推断失效。
- 如果未采用Uber运行模式,则会向调度器主动发送一个SpeculatorEvent事件。此处构造SpeculatorEvent事件的代码如下;
public SpeculatorEvent(JobId jobID, long timestamp) { super(Speculator.EventType.JOB_CREATE, timestamp); this.jobID = jobID; }
由此可见,在启动MRAppMaster的阶段,创建的SpeculatorEvent事件的类型是Speculator.EventType.JOB_CREATE。SpeculatorEventDispatcher的handle方法会被调度器执行,用以处理SpeculatorEvent事件,其代码实现见代码清单6。
代码清单6 SpeculatorEventDispatcher的实现
private class SpeculatorEventDispatcher implements EventHandler<SpeculatorEvent> { private final Configuration conf; private volatile boolean disabled; public SpeculatorEventDispatcher(Configuration config) { this.conf = config; } @Override public void handle(final SpeculatorEvent event) { if (disabled) { return; } TaskId tId = event.getTaskID(); TaskType tType = null; /* event‘s TaskId will be null if the event type is JOB_CREATE or * ATTEMPT_STATUS_UPDATE */ if (tId != null) { tType = tId.getTaskType(); } boolean shouldMapSpec = conf.getBoolean(MRJobConfig.MAP_SPECULATIVE, false); boolean shouldReduceSpec = conf.getBoolean(MRJobConfig.REDUCE_SPECULATIVE, false); /* The point of the following is to allow the MAP and REDUCE speculative * config values to be independent: * IF spec-exec is turned on for maps AND the task is a map task * OR IF spec-exec is turned on for reduces AND the task is a reduce task * THEN call the speculator to handle the event. */ if ( (shouldMapSpec && (tType == null || tType == TaskType.MAP)) || (shouldReduceSpec && (tType == null || tType == TaskType.REDUCE))) { // Speculator IS enabled, direct the event to there. callWithJobClassLoader(conf, new Action<Void>() { public Void call(Configuration conf) { speculator.handle(event); return null; } }); } } public void disableSpeculation() { disabled = true; } }
SpeculatorEventDispatcher的实现告诉我们当启用map或者reduce任务推断时,将异步调用Speculator的handle方法处理SpeculatorEvent事件。以默认的DefaultSpeculator的handle方法为例,来看看其实现,代码如下:
@Override public void handle(SpeculatorEvent event) { processSpeculatorEvent(event); }
上述代码实际代理执行了processSpeculatorEvent方法(见代码清单7)
代码清单7 DefaultSpeculator处理SpeculatorEvent事件的实现
private synchronized void processSpeculatorEvent(SpeculatorEvent event) { switch (event.getType()) { case ATTEMPT_STATUS_UPDATE: statusUpdate(event.getReportedStatus(), event.getTimestamp()); break; case TASK_CONTAINER_NEED_UPDATE: { AtomicInteger need = containerNeed(event.getTaskID()); need.addAndGet(event.containersNeededChange()); break; } case ATTEMPT_START: { LOG.info("ATTEMPT_START " + event.getTaskID()); estimator.enrollAttempt (event.getReportedStatus(), event.getTimestamp()); break; } case JOB_CREATE: { LOG.info("JOB_CREATE " + event.getJobID()); estimator.contextualize(getConfig(), context); break; } } }
当DefaultSpeculator收到类型为JOB_CREATE的SpeculatorEvent事件时会匹配执行以下代码:
case JOB_CREATE: { LOG.info("JOB_CREATE " + event.getJobID()); estimator.contextualize(getConfig(), context); break; }
这里实际也调用了StartEndTimesBase的contextualize方法(见代码清单4),不再赘述。
由于DefaultSpeculator也是MRAppMaster的子组件之一,所以在启动MRAppMaster(调用MRAppMaster的serviceStart)的过程中,也会调用DefaultSpeculator的serviceStart方法(见代码清单8)启动DefaultSpeculator。
代码清单8 启动DefaultSpeculator
protected void serviceStart() throws Exception { Runnable speculationBackgroundCore = new Runnable() { @Override public void run() { while (!stopped && !Thread.currentThread().isInterrupted()) { long backgroundRunStartTime = clock.getTime(); try { int speculations = computeSpeculations(); long mininumRecomp = speculations > 0 ? SOONEST_RETRY_AFTER_SPECULATE : SOONEST_RETRY_AFTER_NO_SPECULATE; long wait = Math.max(mininumRecomp, clock.getTime() - backgroundRunStartTime); if (speculations > 0) { LOG.info("We launched " + speculations + " speculations. Sleeping " + wait + " milliseconds."); } Object pollResult = scanControl.poll(wait, TimeUnit.MILLISECONDS); } catch (InterruptedException e) { if (!stopped) { LOG.error("Background thread returning, interrupted", e); } return; } } } }; speculationBackgroundThread = new Thread (speculationBackgroundCore, "DefaultSpeculator background processing"); speculationBackgroundThread.start(); super.serviceStart(); }
启动DefaultSpeculator的主要目的是启动一个线程不断推断执行进行估算,步骤如下:
- 创建了匿名的实现类speculationBackgroundCore,用于在单独的线程中对推断执行进行估算。
- 创建Thread并启动线程。
speculationBackgroundCore中调用的computeSpeculations方法用于计算推断调度执行的map和reduce任务数量,其实现如下:
private int computeSpeculations() { // We‘ll try to issue one map and one reduce speculation per job per run return maybeScheduleAMapSpeculation() + maybeScheduleAReduceSpeculation(); }
computeSpeculations方法返回的结果等于maybeScheduleAMapSpeculation方法(用于推断需要重新分配Container的map任务数量)和maybeScheduleAReduceSpeculation方法(用于推断需要重新分配Container的reduce任务数量)返回值的和。maybeScheduleAMapSpeculation的实现如下:
private int maybeScheduleAMapSpeculation() { return maybeScheduleASpeculation(TaskType.MAP); } private int maybeScheduleAReduceSpeculation() { return maybeScheduleASpeculation(TaskType.REDUCE); }
maybeScheduleAMapSpeculation和maybeScheduleAReduceSpeculation实际都调用了maybeScheduleASpeculation方法,其实现见代码清单9。
代码清单9 maybeScheduleASpeculation用于计算map或者reduce任务推断调度的可能性
private int maybeScheduleASpeculation(TaskType type) { int successes = 0; long now = clock.getTime(); ConcurrentMap<JobId, AtomicInteger> containerNeeds = type == TaskType.MAP ? mapContainerNeeds : reduceContainerNeeds; for (ConcurrentMap.Entry<JobId, AtomicInteger> jobEntry : containerNeeds.entrySet()) { // This race conditon is okay. If we skip a speculation attempt we // should have tried because the event that lowers the number of // containers needed to zero hasn‘t come through, it will next time. // Also, if we miss the fact that the number of containers needed was // zero but increased due to a failure it‘s not too bad to launch one // container prematurely. if (jobEntry.getValue().get() > 0) { continue; } int numberSpeculationsAlready = 0; int numberRunningTasks = 0; // loop through the tasks of the kind Job job = context.getJob(jobEntry.getKey()); Map<TaskId, Task> tasks = job.getTasks(type); int numberAllowedSpeculativeTasks = (int) Math.max(MINIMUM_ALLOWED_SPECULATIVE_TASKS, PROPORTION_TOTAL_TASKS_SPECULATABLE * tasks.size()); TaskId bestTaskID = null; long bestSpeculationValue = -1L; // this loop is potentially pricey. // TODO track the tasks that are potentially worth looking at for (Map.Entry<TaskId, Task> taskEntry : tasks.entrySet()) { long mySpeculationValue = speculationValue(taskEntry.getKey(), now); if (mySpeculationValue == ALREADY_SPECULATING) { ++numberSpeculationsAlready; } if (mySpeculationValue != NOT_RUNNING) { ++numberRunningTasks; } if (mySpeculationValue > bestSpeculationValue) { bestTaskID = taskEntry.getKey(); bestSpeculationValue = mySpeculationValue; } } numberAllowedSpeculativeTasks = (int) Math.max(numberAllowedSpeculativeTasks, PROPORTION_RUNNING_TASKS_SPECULATABLE * numberRunningTasks); // If we found a speculation target, fire it off if (bestTaskID != null && numberAllowedSpeculativeTasks > numberSpeculationsAlready) { addSpeculativeAttempt(bestTaskID); ++successes; } } return successes; }
maybeScheduleASpeculation方法首先根据当前Task的类型(map或reduce)获取相应类型任务的需要分配Container数量的缓存containerNeeds,然后遍历containerNeeds。
遍历containerNeeds的执行步骤如下:
- 如果当前Job依然有未分配Container的Task,那么跳过当前循环,继续下一次循环。这说明如果当前Job的某一类型的Task依然存在未分配Container的,则不会进行任务推断;
- 从当前应用的上下文AppContext中获取Job,并获取此Job的所有的Task(map或者reduce);
- 计算允许执行推断的Task数量numberAllowedSpeculativeTasks(map或者reduce)。其中MINIMUM_ALLOWED_SPECULATIVE_TASKS的值是10,PROPORTION_TOTAL_TASKS_SPECULATABLE的值是0.01。numberAllowedSpeculativeTasks取MINIMUM_ALLOWED_SPECULATIVE_TASKS与PROPORTION_TOTAL_TASKS_SPECULATABLE*任务数量之积之间的最大值。因此我们知道,当Job的某一类型(map或者reduce)的Task的数量小于1100时,计算得到的numberAllowedSpeculativeTasks等于10,如果Job的某一类型(map或者reduce)的Task的数量大于等于1100时,numberAllowedSpeculativeTasks才会大于10。numberAllowedSpeculativeTasks变量可以有效防止大量任务同时启动备份任务所造成的资源浪费。
- 遍历Job对应的map任务或者reduce任务集合,调用speculationValue方法获取每一个Task的推断值。并在迭代完所有的map任务或者reduce任务后,获取这一任务集合中的推断值bestSpeculationValue最大的任务ID。
- 再次计算numberAllowedSpeculativeTasks,其中PROPORTION_RUNNING_TASKS_SPECULATABLE的值等于0.1,numberRunningTasks是处于运行中的Task。numberAllowedSpeculativeTasks取numberAllowedSpeculativeTasks与PROPORTION_RUNNING_TASKS_SPECULATABLE*numberRunningTasks之积之间的最大值。
- 如果numberAllowedSpeculativeTasks大于numberSpeculationsAlready(已经推断执行过的Task数量),则调用addSpeculativeAttempt方法(见代码清单10)将第4步中选出的任务的任务ID添加到推断尝试中。
代码清单10 添加推断执行的尝试
//Add attempt to a given Task. protected void addSpeculativeAttempt(TaskId taskID) { LOG.info ("DefaultSpeculator.addSpeculativeAttempt -- we are speculating " + taskID); eventHandler.handle(new TaskEvent(taskID, TaskEventType.T_ADD_SPEC_ATTEMPT)); mayHaveSpeculated.add(taskID); }
根据代码清单10,我们看到推断执行尝试是通过发送类型为TaskEventType.T_ADD_SPEC_ATTEMPT的TaskEvent事件完成的。
估算任务的推断值
在分析代码清单9时,我故意跳过了speculationValue方法的分析。speculationValue方法(见代码清单11)主要用于估算每个任务的推断值。
代码清单11 估算任务的推断值
private long speculationValue(TaskId taskID, long now) { Job job = context.getJob(taskID.getJobId()); Task task = job.getTask(taskID); Map<TaskAttemptId, TaskAttempt> attempts = task.getAttempts(); long acceptableRuntime = Long.MIN_VALUE; long result = Long.MIN_VALUE; if (!mayHaveSpeculated.contains(taskID)) { acceptableRuntime = estimator.thresholdRuntime(taskID); if (acceptableRuntime == Long.MAX_VALUE) { return ON_SCHEDULE; } } TaskAttemptId runningTaskAttemptID = null; int numberRunningAttempts = 0; for (TaskAttempt taskAttempt : attempts.values()) { if (taskAttempt.getState() == TaskAttemptState.RUNNING || taskAttempt.getState() == TaskAttemptState.STARTING) { if (++numberRunningAttempts > 1) { return ALREADY_SPECULATING; } runningTaskAttemptID = taskAttempt.getID(); long estimatedRunTime = estimator.estimatedRuntime(runningTaskAttemptID); long taskAttemptStartTime = estimator.attemptEnrolledTime(runningTaskAttemptID); if (taskAttemptStartTime > now) { // This background process ran before we could process the task // attempt status change that chronicles the attempt start return TOO_NEW; } long estimatedEndTime = estimatedRunTime + taskAttemptStartTime; long estimatedReplacementEndTime = now + estimator.estimatedNewAttemptRuntime(taskID); float progress = taskAttempt.getProgress(); TaskAttemptHistoryStatistics data = runningTaskAttemptStatistics.get(runningTaskAttemptID); if (data == null) { runningTaskAttemptStatistics.put(runningTaskAttemptID, new TaskAttemptHistoryStatistics(estimatedRunTime, progress, now)); } else { if (estimatedRunTime == data.getEstimatedRunTime() && progress == data.getProgress()) { // Previous stats are same as same stats if (data.notHeartbeatedInAWhile(now)) { // Stats have stagnated for a while, simulate heart-beat. TaskAttemptStatus taskAttemptStatus = new TaskAttemptStatus(); taskAttemptStatus.id = runningTaskAttemptID; taskAttemptStatus.progress = progress; taskAttemptStatus.taskState = taskAttempt.getState(); // Now simulate the heart-beat handleAttempt(taskAttemptStatus); } } else { // Stats have changed - update our data structure data.setEstimatedRunTime(estimatedRunTime); data.setProgress(progress); data.resetHeartBeatTime(now); } } if (estimatedEndTime < now) { return PROGRESS_IS_GOOD; } if (estimatedReplacementEndTime >= estimatedEndTime) { return TOO_LATE_TO_SPECULATE; } result = estimatedEndTime - estimatedReplacementEndTime; } } // If we are here, there‘s at most one task attempt. if (numberRunningAttempts == 0) { return NOT_RUNNING; } if (acceptableRuntime == Long.MIN_VALUE) { acceptableRuntime = estimator.thresholdRuntime(taskID); if (acceptableRuntime == Long.MAX_VALUE) { return ON_SCHEDULE; } } return result; }
speculationValue方法的执行步骤如下:
- 如果任务还没有被推断执行,那么调用estimator的thresholdRuntime方法获取任务可以接受的运行时长acceptableRuntime。如果acceptableRuntime等于Long.MAX_VALUE,则将ON_SCHEDULE作为返回值,ON_SCHEDULE的值是Long.MIN_VALUE,以此表示当前任务的推断值很小,即被推断尝试的可能最小。
- 如果任务的运行实例数大于1,则说明此任务已经发生了推断执行,因此返回ALREADY_SPECULATING。ALREADY_SPECULATING等于Long.MIN_VALUE + 1。
- 调用estimator的estimatedRuntime方法获取任务运行实例的估算运行时长estimatedRunTime。
- 调用estimator的attemptEnrolledTime方法获取任务实例开始运行的时间,此时间即为startTimes中缓存的start。这个值是在任务实例启动时导致DefaultSpeculator的processSpeculatorEvent方法处理Speculator.EventType.ATTEMPT_START类型的SpeculatorEvent事件时保存的。
- estimatedEndTime表示估算任务实例的运行结束时间,estimatedEndTime = estimatedRunTime + taskAttemptStartTime。
- 调用estimator的estimatedNewAttemptRuntime方法估算如果此时重新为任务启动一个实例,此实例运行结束的时间estimatedReplacementEndTime。
- 如果缓存中没有任务实例的历史统计信息,那么将estimatedRunTime、任务实例进度progress,当前时间封装为历史统计信息缓存起来。
- 如果缓存中存在任务实例的历史统计信息,如果缓存的estimatedRunTime和本次估算的estimatedRunTime一样并且缓存的实例进度progress和本次获取的任务实例进度progress一样,当有一段时间没有收到心跳了,则模拟一次心跳。如果缓存的estimatedRunTime和本次估算的estimatedRunTime不一样或者缓存的实例进度progress和本次获取的任务实例进度progress不一样,那么将estimatedRunTime、任务实例进度progress,当前时间更新到任务实例的历史统计信息中。
- 如果estimatedEndTime小于当前时间,则说明任务实例的进度良好,返回PROGRESS_IS_GOOD,PROGRESS_IS_GOOD等于Long.MIN_VALUE + 3。
- 如果estimatedReplacementEndTime大于等于estimatedEndTime,则说明即便启动备份任务实例也无济于事,因为它的结束时间达不到节省作业总运行时长的作用。
- 计算本次估算的结果值result,它等于estimatedEndTime - estimatedReplacementEndTime,当这个差值越大表示备份任务实例运行后比原任务实例的结束时间就越早,因此调度执行的价值越大。
- 如果numberRunningAttempts等于0,则表示当前任务还没有启动任务实例,返回NOT_RUNNING,NOT_RUNNING等于Long.MIN_VALUE + 4。
- 重新计算acceptableRuntime,处理方式与第1步相同。
- 返回result。
总结
根据源码分析,我们知道DefaultSpeculator启动的线程会不时去计算作业的各个任务的推断值,即speculationValue方法计算的结果。从所有任务的推断值中选择值最大,也就是说价值最高的,为其配备一个备份任务。这里有个问题,Estimator推算用的各种统计和监控数据是从哪里来的呢?我将另写一文详细说明。
后记:个人总结整理的《深入理解Spark:核心思想与源码分析》一书现在已经正式出版上市,目前京东、当当、天猫等网站均有销售,欢迎感兴趣的同学购买。
京东:http://item.jd.com/11846120.html
当当:http://product.dangdang.com/23838168.html