【Spark】RDD操作具体解释2——值型Transformation算子 / 憋错料

处理数据类型为Value型的Transformation算子能够依据RDD变换算子的输入分区与输出分区关系分为下面几种类型:

1）输入分区与输出分区一对一型

2）输入分区与输出分区多对一型

3）输入分区与输出分区多对多型

4）输出分区为输入分区子集型

5）另一种特殊的输入与输出分区一对一的算子类型：Cache型。 Cache算子对RDD分区进行缓存

输入分区与输出分区一对一型

（1）map

将原来RDD的每一个数据项通过map中的用户自己定义函数f映射转变为一个新的元素。

源代码中的map算子相当于初始化一个RDD。新RDD叫作MapPartitionsRDD(this,sc.clean(f))。

图中，每一个方框表示一个RDD分区，左側的分区经过用户自己定义函数f：T->U映射为右側的新的RDD分区。可是实际仅仅有等到Action算子触发后。这个f函数才会和其它函数在一个Stage中对数据进行运算。 V1输入f转换输出V’ 1。

源代码：

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

（2）flatMap

将原来RDD中的每一个元素通过函数f转换为新的元素，并将生成的RDD的每一个集合中的元素合并为一个集合。内部创建FlatMappedRDD（this，sc.clean（f））。

图中。小方框表示RDD的一个分区，对分区进行flatMap函数操作，flatMap中传入的函数为f:T->U，T和U能够是随意的数据类型。

将分区中的数据通过用户自己定义函数f转换为新的数据。外部慷慨框能够觉得是一个RDD分区，小方框代表一个集合。 V1、 V2、 V3在一个集合作为RDD的一个数据项，转换为V’ 1、 V’ 2、 V’ 3后，将结合拆散。形成为RDD中的数据项。

源代码：

  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

（3）mapPartitions

mapPartitions函数获取到每一个分区的迭代器，在函数中通过这个分区总体的迭代器对整个分区的元素进行操作。内部实现是生成MapPartitionsRDD。

图中，用户通过函数f(iter) => iter.filter(_>=3)对分区中的全部数据进行过滤，>=3的数据保留。一个方块代表一个RDD分区，含有1、 2、 3的分区过滤仅仅剩下元素3。

源代码：

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn‘t modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
    val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
    new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
  }S

（4）glom

glom函数将每一个分区形成一个数组，内部实现是返回的RDD[Array[T]]。

图中，方框代表一个分区。

该图表示含有V1、 V2、 V3的分区通过函数glom形成一个数组Array[(V1),(V2),(V3)]。

源代码：

  /**S
   * Return an RDD created by coalescing all elements within each partition into an array.
   */
  def glom(): RDD[Array[T]] = {
    new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
  }

输入分区与输出分区多对一型

（1）union

使用union函数时须要保证两个RDD元素的数据类型同样，返回的RDD数据类型和被合并的RDD元素数据类型同样。并不进行去重操作，保存全部元素。

假设想去重。能够使用distinct（）。

++符号相当于uion函数操作。

图中，左側的慷慨框代表两个RDD，慷慨框内的小方框代表RDD的分区。右側慷慨框代表合并后的RDD，慷慨框内的小方框代表分区。含有V1，V2…U4的RDD和含有V1，V8…U8的RDD合并全部元素形成一个RDD。V1、V1、V2、V8形成一个分区，其它元素同理进行合并。

源代码：

  /**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T] = {
    if (partitioner.isDefined && other.partitioner == partitioner) {
      new PartitionerAwareUnionRDD(sc, Array(this, other))
    } else {
      new UnionRDD(sc, Array(this, other))
    }
  }

（2）certesian

对两个RDD内的全部元素进行笛卡尔积操作。

操作后，内部实现返回CartesianRDD。

左側的慷慨框代表两个RDD。慷慨框内的小方框代表RDD的分区。右側慷慨框代表合并后的RDD。慷慨框内的小方框代表分区。

慷慨框代表RDD，慷慨框中的小方框代表RDD分区。

比如，V1和另一个RDD中的W1、 W2、 Q5进行笛卡尔积运算形成（V1，W1）、（V1。W2）、（V1。Q5）。

源代码：

  /**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = new CartesianRDD(sc, this, other)

输入分区与输出分区多对多型

groupBy

将元素通过函数生成对应的Key，数据就转化为Key-Value格式，之后将Key同样的元素分为一组。

val cleanF = sc.clean(f)中sc.clean函数将用户函数预处理。

this.map(t => (cleanF(t), t)).groupByKey(p)对数据map进行函数操作，再对groupByKey进行分组操作。当中。p中确定了分区个数和分区函数，也就决定了并行化的程度。

图中。方框代表一个RDD分区，同样key的元素合并到一个组。

比如。V1，V2合并为一个Key-Value对，当中key为“ V” ，Value为“ V1。V2” ，形成V，Seq（V1。V2）。

源代码：

  /**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * Note: This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
   */
  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] =
    groupBy[K](f, defaultPartitioner(this))

  /**
   * Return an RDD of grouped elements. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * Note: This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
   */
  def groupBy[K](f: T => K, numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] =
    groupBy(f, new HashPartitioner(numPartitions))

  /**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * Note: This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
   */
  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

输出分区为输入分区子集型

（1）filter

filter的功能是对元素进行过滤，对每一个元素应用f函数，返回值为true的元素在RDD中保留，返回为false的将过滤掉。内部实现相当于生成FilteredRDD(this，sc.clean(f))。

图中。每一个方框代表一个RDD分区。

T能够是随意的类型。

通过用户自己定义的过滤函数f，对每一个数据项进行操作，将满足条件，返回结果为true的数据项保留。比如，过滤掉V2、 V3保留了V1。将区分命名为V1’。

源代码：

  /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

（2）distinct

distinct将RDD中的元素进行去重操作。

图中。每一个方框代表一个分区。通过distinct函数。将数据去重。

比如。反复数据V1、 V1去重后仅仅保留一份V1。

源代码：

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = distinct(partitions.size)

（3）subtract

subtract相当于进行集合的差操作，RDD 1去除RDD 1和RDD 2交集中的全部元素。

图中。左側的慷慨框代表两个RDD，慷慨框内的小方框代表RDD的分区。右側慷慨框代表合并后的RDD。慷慨框内的小方框代表分区。V1在两个RDD中均有，依据差集运算规则。新RDD不保留。V2在第一个RDD有，第二个RDD没有。则在新RDD元素中包括V2。

源代码：

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be &lt;= us.
   */
  def subtract(other: RDD[T]): RDD[T] =
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.size)))

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   */
  def subtract(other: RDD[T], numPartitions: Int): RDD[T] =
    subtract(other, new HashPartitioner(numPartitions))

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   */
  def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = {
    if (partitioner == Some(p)) {
      // Our partitioner knows how to handle T (which, since we have a partitioner, is
      // really (K, V)) so make a new Partitioner that will de-tuple our fake tuples
      val p2 = new Partitioner() {
        override def numPartitions = p.numPartitions
        override def getPartition(k: Any) = p.getPartition(k.asInstanceOf[(Any, _)]._1)
      }
      // Unfortunately, since we‘re making a new p2, we‘ll get ShuffleDependencies
      // anyway, and when calling .keys, will not have a partitioner set, even though
      // the SubtractedRDD will, thanks to p2‘s de-tupled partitioning, already be
      // partitioned by the right/real keys (e.g. p).
      this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys
    } else {
      this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys
    }
  }

（4）sample

sample将RDD这个集合内的元素进行採样。获取全部元素的子集。

用户能够设定是否有放回的抽样、百分比、随机种子。进而决定採样方式。

參数说明：

withReplacement=true，表示有放回的抽样；

withReplacement=false。表示无放回的抽样。

每一个方框是一个RDD分区。通过sample函数，採样50%的数据。V1、V2、U1、U2、U3、U4採样出数据V1和U1、U2，形成新的RDD。

源代码：

  /**
   * Return a sampled subset of this RDD.
   */
  def sample(withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }

（5）takeSample

takeSample()函数和上面的sample函数是一个原理。可是不使用相对照例採样。而是按设定的採样个数进行採样。同一时候返回结果不再是RDD，而是相当于对採样后的数据进行collect()，返回结果的集合为单机的数组。

图中，左側的方框代表分布式的各个节点上的分区。右側方框代表单机上返回的结果数组。通过takeSample对数据採样。设置为採样一份数据，返回结果为V1。

源代码：

  /**
   * Return a fixed-size sampled subset of this RDD in an array
   *
   * @param withReplacement whether sampling is done with replacement
   * @param num size of the returned sample
   * @param seed seed for the random number generator
   * @return sample of specified size in an array
   */
  def takeSample(withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T] = {
    val numStDev =  10.0

    if (num < 0) {
      throw new IllegalArgumentException("Negative number of elements requested")
    } else if (num == 0) {
      return new Array[T](0)
    }

    val initialCount = this.count()
    if (initialCount == 0) {
      return new Array[T](0)
    }

    val maxSampleSize = Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt
    if (num > maxSampleSize) {
      throw new IllegalArgumentException("Cannot support a sample size > Int.MaxValue - " +
        s"$numStDev * math.sqrt(Int.MaxValue)")
    }

    val rand = new Random(seed)
    if (!withReplacement && num >= initialCount) {
      return Utils.randomizeInPlace(this.collect(), rand)
    }

    val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
      withReplacement)

    var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()

    // If the first sample didn‘t turn out large enough, keep trying to take samples;
    // this shouldn‘t happen often because we use a big multiplier for the initial size
    var numIters = 0
    while (samples.length < num) {
      logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
      samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
      numIters += 1
    }

    Utils.randomizeInPlace(samples, rand).take(num)
  }

Cache型

（1）cache

cache将RDD元素从磁盘缓存到内存，相当于persist（MEMORY_ONLY）函数的功能。

图中。每一个方框代表一个RDD分区。左側相当于数据分区都存储在磁盘。通过cache算子将数据缓存在内存。

源代码：

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def cache(): this.type = persist()

（2）persist

persist函数对RDD进行缓存操作。

数据缓存在哪里由StorageLevel枚举类型确定。

有几种类型的组合，DISK代表磁盘，MEMORY代表内存，SER代表数据是否进行序列化存储。

StorageLevel是枚举类型，代表存储模式，如。MEMORY_AND_DISK_SER代表数据能够存储在内存和磁盘，而且以序列化的方式存储。

其它同理。

图中，方框代表RDD分区。

disk代表存储在磁盘。mem代表存储在内存。

数据最初全部存储在磁盘，通过persist（MEMORY_AND_DISK）将数据缓存到内存，可是有的分区无法容纳在内存。比如：图3-18中将含有V1，V2，V3的RDD存储到磁盘，将含有U1。U2的RDD仍旧存储在内存。

源代码：

  /**
   * Set this RDD‘s storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet..
   */
  def persist(newLevel: StorageLevel): this.type = {
    // TODO: Handle changes of StorageLevel
    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {
      throw new UnsupportedOperationException(
        "Cannot change storage level of an RDD after it was already assigned a level")
    }
    sc.persistRDD(this)
    // Register the RDD with the ContextCleaner for automatic GC-based cleanup
    sc.cleaner.foreach(_.registerRDDForCleanup(this))
    storageLevel = newLevel
    this
  }

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

转载请注明作者Jason Ding及其出处

GitCafe博客主页(http://jasonding1354.gitcafe.io/)

Github博客主页(http://jasonding1354.github.io/)

CSDN博客(http://blog.csdn.net/jasonding1354)

简书主页(http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)

Google搜索jasonding1354进入我的博客主页

时间： 2024-10-10 14:20:58

【Spark】RDD操作具体解释2——值型Transformation算子

输入分区与输出分区一对一型

（1）map

（2）flatMap

（3）mapPartitions

（4）glom

输入分区与输出分区多对一型

（1）union

（2）certesian

输入分区与输出分区多对多型

groupBy

输出分区为输入分区子集型

（1）filter

（2）distinct

（3）subtract

（4）sample

（5）takeSample

Cache型

（1）cache

（2）persist

【Spark】RDD操作具体解释2——值型Transformation算子的相关文章

【Spark】RDD操作详解2——值型Transformation算子

【Spark】RDD操作详解3——键值型Transformation算子

【Spark】RDD操作具体解释4——Action算子

Spark RDD操作(1)

Spark RDD操作记录(总结)

Spark RDD 操作实战之文件读取

RDD之五：Key-Value型Transformation算子

Spark RDD常用算子操作（八）键值对关联操作 subtractByKey, join,fullOuterJoin, rightOuterJoin, leftOuterJoin

【Spark】RDD操作详解1——Transformation和Actions概况