class SparkContext extends Logging with ExecutorAllocationClient
Main entry point for Spark functionality.
spark功能函数的主入口。
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
Distribute a local Scala collection to form an RDD.
将一个本地Scala collection 格式化为一个RDD。
- Note
-
Parallelize acts lazily. Ifseq
is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this.
注意
Parallelize是懒动作函数.如果参数seq是一个易变的collection,并且在调用parallelize之后但又在一个对RDD的action之前的期间会被修改,那么所得的RDD将会反应出被修改的collection,导致结果可能会不可预料。所以,向本函数的参数seq传递一个副本。
checkpoint(self)
Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
checkpoint(self)
标记当前RDD的校验点。它会被保存为在由SparkContext.setCheckpointDir()方法设置的checkpoint目录下的文件集中的一个文件。简而言之就是当前RDD的校验点被保存为了一个文件,而这个文件在一个目录下,这个目录下有不少的这样的文件,这个目录是由SparkContext.setCheckpointDir()方法设置的。并且所有从父RDD中引用的文件都将被删除。这个函数必须在所有的job前被调用,运行在这个RDD上。它被强烈的建议保存在内存中,否则,也就是从内存转出存入文件,则需要重新计算它。
scala:
def setCheckpointDir(directory: String): Unit
Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.
设置一个目录,用来让RDD们可以在其下被checkpoint。如果是跑在一个集群上,这个目录必须是一个HDFS路径。