Spark源码阅读笔记之Broadcast(一)

Spark源码阅读笔记之Broadcast(一)

Spark会序列化在各个任务上使用到的变量,然后传递到Executor中,由于Executor中得到的只是变量的拷贝,因此对变量的改变只在该Executor有效。序列化后的任务的大小是有限制的(由spark.akka.frameSize决定,值为其减去200K,默认为10M-200K),spark会进行检查,超出该限制的任务会被抛弃。因此,对于需要共享比较大的数据时,需要使用Broadcast。

Spark实现了两种传输Broadcast的机制:Http和Torrent(由参数spark.broadcast.factory确定,默认为org.apache.spark.broadcast.TorrentBroadcastFactory,可以修改为org.apache.spark.broadcast.HttpBroadcastFactory)。

Http的传输机制是在Driver中启动Http服务,然后将需要传输的变量存储为Http服务根目录下的一个文件,当Executor中需要使用时便向Http服务请求,下载该文件,然后读取。这种机制下,所有Executor都需要向Driver请求下载,Driver的网络通信会成为瓶颈。

Torrent则是一种类似于BitTorrent的实现机制。当要共享变量时,Driver先将序列化后的变量分片,然后存储到BlockManager中,当Executor需要使用时,则会以重新洗牌后的顺序向BlockManager请求该变量的各个分片,然后重新组合成完整的变量。在Executor请求变量的分片的过程中,每当该Executor取得一个分片,则马上存储到BlockManager中,这样其他的Executor若需要该分片时也可以向已经取得分片的Executor获取,而不需向Driver获取,同时由于各个Executor都是按随机洗牌的顺序来请求各个分片的,因此一般不会出现所有的Executor同时请求相同的分片情况,从而造成Driver网络开销过大。因此该机制能够防止Driver的网络通信成为瓶颈。

A BitTorrent-like implementation.The mechanism is as follows:

The driver divides the serialized object into small chunks and stores those chunks in the BlockManager of the driver.

On each executor, the executor first attempts to fetch the object from its BlockManager. If it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or other executors if available. Once it gets the chunks, it puts the chunks in its own BlockManager, ready for other executors to fetch from.

This prevents the driver from being the bottleneck in sending out multiple copies of the

broadcast data (one per executor) as done by the [[org.apache.spark.broadcast.HttpBroadcast]].

Spark统一通过BroadcastManager来创建Broadcast(SparkContext中的broadcast函数调用了BroadcastManager的newBroadcast函数,BroadcastManager则在SparkEnv中被创建),BroadcastManager则封装了BroadcastFactoryBroadcastManagerinitializenewBroadcastunbroadcaststop四个函数:initialize函数根据spark.broadcast.factory的配置创建BroadcastFactory,并初始化;newBroadcast调用BroadcastFactorynewBroadcast函数;unbroadcast调用BroadcastFactoryunbroadcast函数;stop函数调用BroadcastFactorystop函数。

BroadcastManager代码

private[spark] class BroadcastManager(
    val isDriver: Boolean,
    conf: SparkConf,
    securityManager: SecurityManager)
  extends Logging {

  private var initialized = false
  private var broadcastFactory: BroadcastFactory = null

  initialize()

  // Called by SparkContext or Executor before using Broadcast
  private def initialize() {
    synchronized {
      if (!initialized) {
        val broadcastFactoryClass =
          conf.get("spark.broadcast.factory", "org.apache.spark.broadcast.TorrentBroadcastFactory")

        broadcastFactory =
          Class.forName(broadcastFactoryClass).newInstance.asInstanceOf[BroadcastFactory]

        // Initialize appropriate BroadcastFactory and BroadcastObject
        broadcastFactory.initialize(isDriver, conf, securityManager)

        initialized = true
      }
    }
  }

  def stop() {
    broadcastFactory.stop()
  }

  private val nextBroadcastId = new AtomicLong(0)

  def newBroadcast[T: ClassTag](value_ : T, isLocal: Boolean) = {
    broadcastFactory.newBroadcast[T](value_, isLocal, nextBroadcastId.getAndIncrement())
  }

  def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean) {
    broadcastFactory.unbroadcast(id, removeFromDriver, blocking)
  }
}

BroadcastFactory是一个接口(特质),有initializenewBroadcastunbroadcaststop四个函数来供BroadcastManager调用。BroadcastFactory有两种实现:HttpBroadcastFactoryTorrentBroadcastFactory,对应Broadcast的两种传输机制:Http和Torrent。

An interface for all the broadcast implementations in Spark (to allow multiple broadcast implementations). SparkContext uses a user-specified BroadcastFactory implementation to instantiate a particular broadcast for the entire Spark job.

BroadcastFactory代码

trait BroadcastFactory {

  def initialize(isDriver: Boolean, conf: SparkConf, securityMgr: SecurityManager): Unit

  /**
   * Creates a new broadcast variable.
   *
   * @param value value to broadcast
   * @param isLocal whether we are in local mode (single JVM process)
   * @param id unique id representing this broadcast variable
   */
  def newBroadcast[T: ClassTag](value: T, isLocal: Boolean, id: Long): Broadcast[T]

  def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean): Unit

  def stop(): Unit
}

BroadcastFactorynewBroadcast生成BroadcastBroadcast是一个实现Serializable接口的抽象类,主要有三个抽象方法:getValuedoUnpersistdoDestroy,其他的方法底层都会调用这三个方法,因此子类需要实现这三个方法和序列化机制。

Broadcast代码

abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging {

  /**
   * Flag signifying whether the broadcast variable is valid
   * (that is, not already destroyed) or not.
   */
  @volatile private var _isValid = true

  private var _destroySite = ""

  /** Get the broadcasted value. */
  def value: T = {
    assertValid()
    getValue()
  }

  /**
   * Asynchronously delete cached copies of this broadcast on the executors.
   * If the broadcast is used after this is called, it will need to be re-sent to each executor.
   */
  def unpersist() {
    unpersist(blocking = false)
  }

  /**
   * Delete cached copies of this broadcast on the executors. If the broadcast is used after
   * this is called, it will need to be re-sent to each executor.
   * @param blocking Whether to block until unpersisting has completed
   */
  def unpersist(blocking: Boolean) {
    assertValid()
    doUnpersist(blocking)
  }

  /**
   * Destroy all data and metadata related to this broadcast variable. Use this with caution;
   * once a broadcast variable has been destroyed, it cannot be used again.
   * This method blocks until destroy has completed
   */
  def destroy() {
    destroy(blocking = true)
  }

  /**
   * Destroy all data and metadata related to this broadcast variable. Use this with caution;
   * once a broadcast variable has been destroyed, it cannot be used again.
   * @param blocking Whether to block until destroy has completed
   */
  private[spark] def destroy(blocking: Boolean) {
    assertValid()
    _isValid = false
    _destroySite = Utils.getCallSite().shortForm
    logInfo("Destroying %s (from %s)".format(toString, _destroySite))
    doDestroy(blocking)
  }

  /**
   * Whether this Broadcast is actually usable. This should be false once persisted state is
   * removed from the driver.
   */
  private[spark] def isValid: Boolean = {
    _isValid
  }

  /**
   * Actually get the broadcasted value. Concrete implementations of Broadcast class must
   * define their own way to get the value.
   */
  protected def getValue(): T

  /**
   * Actually unpersist the broadcasted value on the executors. Concrete implementations of
   * Broadcast class must define their own logic to unpersist their own data.
   */
  protected def doUnpersist(blocking: Boolean)

  /**
   * Actually destroy all data and metadata related to this broadcast variable.
   * Implementation of Broadcast class must define their own logic to destroy their own
   * state.
   */
  protected def doDestroy(blocking: Boolean)

  /** Check if this broadcast is valid. If not valid, exception is thrown. */
  protected def assertValid() {
    if (!_isValid) {
      throw new SparkException(
        "Attempted to use %s after it was destroyed (%s) ".format(toString, _destroySite))
    }
  }

  override def toString = "Broadcast(" + id + ")"
}

版权声明:本文为博主原创文章,未经博主允许不得转载。

时间: 2024-08-02 15:14:13

Spark源码阅读笔记之Broadcast(一)的相关文章

第3课 Scala函数式编程彻底精通及Spark源码阅读笔记

本课内容: 1:scala中函数式编程彻底详解 2:Spark源码中的scala函数式编程 3:案例和作业 函数式编程开始: def fun1(name: String){ println(name) } //将函数名赋值给一个变量,那么这个变量就是一个函数了. val fun1_v = fun1_ 访问 fun1_v("Scala") 结果:Scala 匿名函数:参数名称用 => 指向函数体 val fun2=(content: String) => println(co

spark源码阅读笔记RDD(七) RDD的创建、读取和保存

Spark支持很多输入和输出源,同时还支持内建RDD.Spark本身是基于Hadoop的生态圈,它可以通过 Hadoop MapReduce所使用的InpoutFormat和OutputFormat接口访问数据.而且大部分的文件格式和存储系统 (HDFS,Hbase,S3等)都支持这种接口.Spark常见的数据源如下: (1) 文件格式和文件系统,也就是我们经常用的TXT,JSON,CSV等這些文件格式 (2)SparkSQL中的结构化数据源 (3)数据库与键值存储(Hbase和JDBC源) 当

Apache Storm源码阅读笔记

欢迎转载,转载请注明出处. 楔子 自从建了Spark交流的QQ群之后,热情加入的同学不少,大家不仅对Spark很热衷对于Storm也是充满好奇.大家都提到一个问题就是有关storm内部实现机理的资料比较少,理解起来非常费劲. 尽管自己也陆续对storm的源码走读发表了一些博文,当时写的时候比较匆忙,有时候衔接的不是太好,此番做了一些整理,主要是针对TridentTopology部分,修改过的内容采用pdf格式发布,方便打印. 文章中有些内容的理解得益于徐明明和fxjwind两位的指点,非常感谢.

CI框架源码阅读笔记3 全局函数Common.php

从本篇开始,将深入CI框架的内部,一步步去探索这个框架的实现.结构和设计. Common.php文件定义了一系列的全局函数(一般来说,全局函数具有最高的加载优先权,因此大多数的框架中BootStrap引导文件都会最先引入全局函数,以便于之后的处理工作). 打开Common.php中,第一行代码就非常诡异: if ( ! defined('BASEPATH')) exit('No direct script access allowed'); 上一篇(CI框架源码阅读笔记2 一切的入口 index

源码阅读笔记 - 1 MSVC2015中的std::sort

大约寒假开始的时候我就已经把std::sort的源码阅读完毕并理解其中的做法了,到了寒假结尾,姑且把它写出来 这是我的第一篇源码阅读笔记,以后会发更多的,包括算法和库实现,源码会按照我自己的代码风格格式化,去掉或者展开用于条件编译或者debug检查的宏,依重要程度重新排序函数,但是不会改变命名方式(虽然MSVC的STL命名实在是我不能接受的那种),对于代码块的解释会在代码块前(上面)用注释标明. template<class _RanIt, class _Diff, class _Pr> in

CI框架源码阅读笔记5 基准测试 BenchMark.php

上一篇博客(CI框架源码阅读笔记4 引导文件CodeIgniter.php)中,我们已经看到:CI中核心流程的核心功能都是由不同的组件来完成的.这些组件类似于一个一个单独的模块,不同的模块完成不同的功能,各模块之间可以相互调用,共同构成了CI的核心骨架. 从本篇开始,将进一步去分析各组件的实现细节,深入CI核心的黑盒内部(研究之后,其实就应该是白盒了,仅仅对于应用来说,它应该算是黑盒),从而更好的去认识.把握这个框架. 按照惯例,在开始之前,我们贴上CI中不完全的核心组件图: 由于BenchMa

CI框架源码阅读笔记2 一切的入口 index.php

上一节(CI框架源码阅读笔记1 - 环境准备.基本术语和框架流程)中,我们提到了CI框架的基本流程,这里这次贴出流程图,以备参考: 作为CI框架的入口文件,源码阅读,自然由此开始.在源码阅读的过程中,我们并不会逐行进行解释,而只解释核心的功能和实现. 1.       设置应用程序环境 define('ENVIRONMENT', 'development'); 这里的development可以是任何你喜欢的环境名称(比如dev,再如test),相对应的,你要在下面的switch case代码块中

CI框架源码阅读笔记4 引导文件CodeIgniter.php

到了这里,终于进入CI框架的核心了.既然是"引导"文件,那么就是对用户的请求.参数等做相应的导向,让用户请求和数据流按照正确的线路各就各位.例如,用户的请求url: http://you.host.com/usr/reg 经过引导文件,实际上会交给Application中的UsrController控制器的reg方法去处理. 这之中,CodeIgniter.php做了哪些工作?我们一步步来看. 1.    导入预定义常量.框架环境初始化 之前的一篇博客(CI框架源码阅读笔记2 一切的入

IOS测试框架之:athrun的InstrumentDriver源码阅读笔记

athrun的InstrumentDriver源码阅读笔记 作者:唯一 athrun是淘宝的开源测试项目,InstrumentDriver是ios端的实现,之前在公司项目中用过这个框架,没有深入了解,现在回来记录下. 官方介绍:http://code.taobao.org/p/athrun/wiki/instrumentDriver/ 优点:这个框架是对UIAutomation的java实现,在代码提示.用例维护方面比UIAutomation强多了,借junit4的光,我们可以通过junit4的