Checkpoint彻底解密：Checkpoint的运行原理和源码实现彻底详解(DT大数据梦工厂)

内容：

1、Checkpoint重大价值；

2、Checkpoint运行原理图；

3、Checkpoint源码解析；

机器学习、图计算稍微复杂迭代算法的时候都有Checkpoint的身影，作用不亚于persist

==========Checkpoint到底是什么============

1、Spark 在生产环境下经常会面临transformation的RDD非常多（例如一个Job中包含1万个RDD）或者具体transformation的RDD本身计算特别复杂或者耗时（例如计算时长超过1个小时），这个时候就要考虑对计算结果数据的持久化；

2、Spark是擅长多步骤迭代的，同时擅长基于Job的复用，这个时候如果能够对曾经计算的过程产生的数据进行复用，就可以极大的提升效率；

3、如果采用persist把数据放在内存中，虽然是快速的，但是也是最不可靠的；如果把数据放在磁盘上，也不是完全可靠的！例如磁盘会损坏，系统管理员可能清空磁盘。

4、Checkpoint的产生就是为了相对而言更加可靠的持久化数据，在Checkpoint的时候可以指定把数据放在本地，并且是多副本的方式，但是在生产环境下是放在HDFS上，这就天然的借助了HDFS高容错、高可靠的特征来完成了最大化的可靠的持久化数据的方式；

5、Checkpoint是为了最大程度保证绝对可靠的复用RDD计算数据的Spark高级功能，通过checkpoint我们通常把数据持久化到HDFS来保证数据最大程度的安全性；

6、Checkpoint就是针对整个RDD计算链条中特别需要数据持久化的环节（后面会反复使用当前环节的RDD）开始基于HDFS等的数据持久化复用策略，通过对RDD启动checkpoint机制来实现容错和高可用；

加入进行一个1万个步骤，在9000个步骤的时候persist，数据还是有可能丢失的，但是如果checkpoint，数据丢失的概率几乎为0。

==========Checkpoint原理机制============

1、在SparkContext中设置进行Checkpoint操作的RDD，把数据放在哪里，在生产集群中运行，必须是HDFS的路径，同时为了提高效率，在进行Checkpoint的时候，可以指定很多目录；

/**
* Set the directory under which RDDs are going to be checkpointed. The directory must
* be a HDFS path if running on a cluster.
*/
def setCheckpointDir(directory: String) {

// If we are running on a cluster, log a warning if the directory is local.
// Otherwise, the driver may attempt to reconstruct the checkpointed RDD from
// its own local file system, which is incorrect because the checkpoint files
// are actually on the executor machines.
if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
logWarning("Checkpoint directory must be non-local " +
"if Spark is running on a cluster: " + directory)
}

checkpointDir = Option(directory).map { dir =>
val path = new Path(dir, UUID.randomUUID().toString)
val fs = path.getFileSystem(hadoopConfiguration)
fs.mkdirs(path)
fs.getFileStatus(path).getPath.toString
}
}

2、RDD中的checkpoint，在进行这个操作的时候，其所依赖的所有的RDD都会从计算联调中清空掉，保存在之前设置的路径下，且所有parent级别的RDD都会被清空掉；

/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
def checkpoint(): Unit = RDDCheckpointData.synchronized {
// NOTE: we use a global lock here due to complexities downstream with ensuring
// children RDD partitions point to the correct parent partitions. In the future
// we should revisit this consideration.
if (context.checkpointDir.isEmpty) {
throw new SparkException("Checkpoint directory has not been set in the SparkContext")
} else if (checkpointData.isEmpty) {
checkpointData = Some(new ReliableRDDCheckpointData(this))
}
}

3、作为最佳实践，一般在进行checkpoint方法，调用前通常都要进行persist来把当前RDD的数据持久化到内存或者磁盘上（上面代码的注释说只会存在内存中说法不准确），因为checkpoint是lazy级别的，必须要有Job的执行，且在Job执行完成之后才会从后往前回溯哪个RDD进行了checkpoint标记，然后对该标记了要进行Checkpoint的RDD新启动一个Job执行具体的Checkpoint的过程；

4、因为会从计算链条清空，所以Checkpoint改变了RDD的Lineage（血统关系）；

5、当我们调用了checkpoint对RDD进行checkpoint操作的话，此时框架会自动生成RDDCheckpointData，当RDD上运行过一个Job后，就会立即触发RDDCheckpoint中的checkpoint方法，在其内部会调用doCheckpoint，实际上在生产环境下会调用ReliableRDDCheckpointData的doCheckpoint，在生产环境下会导致ReliableCheckpointRDD的writeRDDToCheckpointDirectory的调用，而在writeRDDToCheckpointDirectory方法内部会触发runJob来执行把当前的RDD中的数据写到checkpoint的目录中，同时会产生ReliableCheckpointRDD实例返回；

/**
* This class contains all the information related to RDD checkpointing. Each instance of this
* class is associated with a RDD. It manages process of checkpointing of the associated RDD,
* as well as, manages the post-checkpoint state by providing the updated partitions,
* iterator and preferred locations of the checkpointed RDD.
*/
private[spark] abstract class RDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])
extends Serializable {

import CheckpointState._

// The checkpoint state of the associated RDD.
protected var cpState = Initialized

// The RDD that contains our checkpointed data
private var cpRDD: Option[CheckpointRDD[T]] = None

// TODO: are we sure we need to use a global lock in the following methods?

/**
* Return whether the checkpoint data for this RDD is already persisted.
*/
def isCheckpointed: Boolean = RDDCheckpointData.synchronized {
cpState == Checkpointed
}

/**
* Materialize this RDD and persist its content.
* This is called immediately after the first action invoked on this RDD has completed.
*/
final def checkpoint(): Unit = {
// Guard against multiple threads checkpointing the same RDD by
// atomically flipping the state of this RDDCheckpointData
RDDCheckpointData.synchronized {
if (cpState == Initialized) {
cpState = CheckpointingInProgress
} else {
return
}
}

val newRDD = doCheckpoint()

// Update our state and truncate the RDD lineage
RDDCheckpointData.synchronized {
cpRDD = Some(newRDD)
cpState = Checkpointed
rdd.markCheckpointed()
}
}

/**
* Materialize this RDD and write its content to a reliable DFS.
* This is called immediately after the first action invoked on this RDD has completed.
*/
protected override def doCheckpoint(): CheckpointRDD[T] = {
val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

// Optionally clean our checkpoint files if the reference is out of scope
if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
rdd.context.cleaner.foreach { cleaner =>
cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
}
}

logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
newRDD
}

/**
* Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.
*/
def writeRDDToCheckpointDirectory[T: ClassTag](
originalRDD: RDD[T],
checkpointDir: String,
blockSize: Int = -1): ReliableCheckpointRDD[T] = {

val sc = originalRDD.sparkContext

// Create the output path for the checkpoint
val checkpointDirPath = new Path(checkpointDir)
val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
if (!fs.mkdirs(checkpointDirPath)) {
throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
}

// Save to file, and reload it as an RDD
val broadcastedConf = sc.broadcast(
new SerializableConfiguration(sc.hadoopConfiguration))
// TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
sc.runJob(originalRDD,
writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)

if (originalRDD.partitioner.nonEmpty) {
writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
}

val newRDD = new ReliableCheckpointRDD[T](
sc, checkpointDirPath.toString, originalRDD.partitioner)
if (newRDD.partitions.length != originalRDD.partitions.length) {
throw new SparkException(
s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
}
newRDD
}

王家林老师名片：

中国Spark第一人

新浪微博：http://weibo.com/ilovepains

微信公众号：DT_Spark

博客：http://blog.sina.com.cn/ilovepains

手机：18610086859

QQ：1740415547

邮箱：[email protected]

时间： 2024-10-06 15:54:57

Checkpoint彻底解密：Checkpoint的运行原理和源码实现彻底详解(DT大数据梦工厂)

Checkpoint彻底解密：Checkpoint的运行原理和源码实现彻底详解(DT大数据梦工厂)的相关文章

CacheManager彻底解密：CacheManager运行原理流程图和源码详解(DT大数据梦工厂)

[Spark內核] 第41课：Checkpoint彻底解密：Checkpoint的运行原理和源码实现彻底详解

DT大数据梦工厂第三十五课 Spark系统运行循环流程

Spark内核架构解密(DT大数据梦工厂)

SparkRDD解密(DT大数据梦工厂)

Spark运行原理和RDD解析(DT大数据梦工厂)

HA下Spark集群工作原理(DT大数据梦工厂)

Spark on Yarn彻底解密(DT大数据梦工厂)

Master HA彻底解密(DT大数据梦工厂)