Spark RDD Action 简单用例(一) / 憋错料

collectAsMap(): Map[K, V]

返回key-value对，key是唯一的，如果rdd元素中同一个key对应多个value，则只会保留一个。/** * Return the key-value pairs in this RDD to the master as a Map. * * Warning: this doesn‘t return a multimap (so if you have multiple values to the same key, only *          one value per key is preserved in the map returned) * * @note this method should only be used if the resulting data is expected to be small, as * all the data is loaded into the driver‘s memory. */def collectAsMap(): Map[K, V]

scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collectAsMap
res0: scala.collection.Map[String,Int] = Map(A -> 3, C -> 3, B -> 2)

countByKey(): Map[K, Long]

计算有多少个不同的key./** * Count the number of elements for each key, collecting the results to a local Map. * * Note that this method should only be used if the resulting map is expected to be small, as * the whole thing is loaded into the driver‘s memory. * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which * returns an RDD[T, Long] instead of a map. */def countByKey(): Map[K, Long] = self.withScope {  self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap}

scala> val rdd = sc.parallelize(List((1,1),(1,2),(1,3),(2,1),(2,2),(2,3)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> rdd.countByKey
res5: scala.collection.Map[Int,Long] = Map(1 -> 3, 2 -> 3)

countByValue()

计算不同的value个数，该函数首先通过map将每个元素转成(value,null)的key-value（value为null）对，然后调用countByKey进行统计。

/** * Return the count of each unique value in this RDD as a local map of (value, count) pairs. * * Note that this method should only be used if the resulting map is expected to be small, as * the whole thing is loaded into the driver‘s memory. * To handle very large results, consider using rdd.map(x =&gt; (x, 1L)).reduceByKey(_ + _), which * returns an RDD[T, Long] instead of a map. */def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {  map(value => (value, null)).countByKey()}

scala> val rdd = sc.parallelize(List(1,2,3,4,5,4,4,3,2,1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24

scala> rdd.countByValue
res12: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 2, 2 -> 2, 3 -> 2, 4 -> 3)

lookup(key: K)

根据key值搜索所有的value./** * Return the list of values in the RDD for key `key`. This operation is done efficiently if the * RDD has a known partitioner by only searching the partition that the key maps to. */def lookup(key: K): Seq[V]

scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.lookup("A")
res2: Seq[Int] = WrappedArray(1, 2, 3)

checkpoint()

将RDD数据根据设置的checkpoint目录保存至硬盘中。

/** * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint * directory set with `SparkContext#setCheckpointDir` and all references to its parent * RDDs will be removed. This function must be called before any job has been * executed on this RDD. It is strongly recommended that this RDD is persisted in * memory, otherwise saving it on a file will require recomputation. */def checkpoint(): Unit

/*通过linux命令创建/home/check目录后，设置checkpoint directory*/
scala> sc.setCheckpointDir("/home/check")

scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[6] at parallelize at <console>:24

/*
*执行下面的代码会在/home/check目录下创建一个空的目录/home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49
*/
scala> rdd.checkpoint

/*
执行count后会在上述目录下创建一个rdd目录，rdd目录下是数据文件
*/
scala> rdd.count
res5: Long = 6

[[email protected] ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/
total 8
drwxr-xr-x. 2 root root 4096 Sep  4 10:29 .
drwxr-xr-x. 3 root root 4096 Sep  4 10:29 ..
[[email protected] ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/
total 12
drwxr-xr-x. 3 root root 4096 Sep  4 10:30 .
drwxr-xr-x. 3 root root 4096 Sep  4 10:29 ..
drwxr-xr-x. 2 root root 4096 Sep  4 10:30 rdd-6
[[email protected] ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/rdd-6/
total 32
drwxr-xr-x. 2 root root 4096 Sep  4 10:30 .
drwxr-xr-x. 3 root root 4096 Sep  4 10:30 ..
-rw-r--r--. 1 root root  171 Sep  4 10:30 part-00000
-rw-r--r--. 1 root root   12 Sep  4 10:30 .part-00000.crc
-rw-r--r--. 1 root root  170 Sep  4 10:30 part-00001
-rw-r--r--. 1 root root   12 Sep  4 10:30 .part-00001.crc
-rw-r--r--. 1 root root  170 Sep  4 10:30 part-00002
-rw-r--r--. 1 root root   12 Sep  4 10:30 .part-00002.crc

collect()

返回RDD所有元素的数组。/** * Return an array that contains all of the elements in this RDD. * * @note this method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver‘s memory. */def collect(): Array[T]

scala> val rdd = sc.parallelize(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> rdd.collect
res8: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

toLocalIterator: Iterator[T]

返回一个包含所有算的迭代器。/** * Return an iterator that contains all of the elements in this RDD. * * The iterator will consume as much memory as the largest partition in this RDD. * * Note: this results in multiple Spark jobs, and if the input RDD is the result * of a wide transformation (e.g. join with different partitioners), to avoid * recomputing the input RDD should be cached first. */def toLocalIterator: Iterator[T]

scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val it = rdd.toLocalIterator
it: Iterator[Int] = non-empty iterator

scala> while(it.hasNext){
     | println(it.next)
     | }
1
2
3
4
5
6
7
8
9
10

count()

返回RDD中元素的数量。/** * Return the number of elements in the RDD. */def count(): Long

scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.count
res1: Long = 10

dependencies

返回该RDD的依赖RDD的地址。/** * Get the list of dependencies of this RDD, taking into account whether the * RDD is checkpointed or not. */final def dependencies: Seq[Dependency[_]]

scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rdd1 = rdd.filter(_>3)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at filter at <console>:26

scala> val rdd2 = rdd1.filter(_<6)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:28

scala> rdd2.dependencies
res2: Seq[org.apache.spark.Dependency[_]] = List([email protected])

partitions

以数组形式返回RDD各分区地址/** * Get the array of partitions of this RDD, taking into account whether the * RDD is checkpointed or not. */final def partitions: Array[Partition]

scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.partitions
res4: Array[org.apache.spark.Partition] = Array([email protected], [email protected])

first()

返回RDD的第一个元素。/** * Return the first element in this RDD. */def first(): T

scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> rdd.first
res5: Int = 1

fold(zeroValue: T)(op: (T, T) => T)

使用zeroValue和每个分区的元素进行聚合运算，最后各分区结果和zeroValue再进行一次聚合运算。/** * @param zeroValue the initial value for the accumulated result of each partition for the `op` *                  operator, and also the initial value for the combine results from different *                  partitions for the `op` operator - this will typically be the neutral *                  element (e.g. `Nil` for list concatenation or `0` for summation) * @param op an operator used to both accumulate results within a partition and combine results *                  from different partitions */def fold(zeroValue: T)(op: (T, T) => T): T

scala> val rdd = sc.parallelize(1 to 5)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> rdd.fold(10)(_+_)
res13: Int = 35

时间： 2024-10-16 01:54:40

Spark RDD Action 简单用例(一)

collectAsMap(): Map[K, V]

countByKey(): Map[K, Long]

countByValue()

lookup(key: K)

checkpoint()

collect()

toLocalIterator: Iterator[T]

count()

dependencies

partitions

first()

fold(zeroValue: T)(op: (T, T) => T)

Spark RDD Action 简单用例(一)的相关文章

Spark RDD Action 简单用例(二)

Spark RDD Transformation 简单用例（三）

Spark RDD Transformation 简单用例（一）

Spark RDD Transformation 简单用例（二）

Spark RDD Action操作

spark RDD transformation与action函数巩固 (未完)

Apache Spark RDD（Resilient Distributed Datasets）论文

Spark RDD使用详解1--RDD原理

Apache Spark 2.2.0 中文文档 - Spark RDD（Resilient Distributed Datasets）