spark 体验点滴-client 与 cluster 部署

一. 部署模式原理

When run SparkSubmit --class [mainClass], SparkSubmit will call a childMainClass which is

1. client mode, childMainClass = mainClass

2. standalone cluster mde, childMainClass = org.apache.spark.deploy.Client

3. yarn cluster mode, childMainClass = org.apache.spark.deploy.yarn.Client

The childMainClass is a wrapper of mainClass. The childMainClass will be called in SparkSubmit, and if cluster mode, the childMainClass will talk to the the cluster and launch a process on one woker to run the mainClass.

ps. use "spark-submit -v" to print debug infos.

Yarn client: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master yarn JavaWordCount.jar

childMainclass: org.apache.spark.examples.JavaWordCount

Yarn cluster: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master yarn-cluster JavaWordCount.jar

childMainclass: org.apache.spark.deploy.yarn.Client

Standalone client: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master spark://aa01:7077 JavaWordCount.jar

childMainclass: org.apache.spark.examples.JavaWordCount

Stanalone cluster: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master spark://aa01:7077 --deploy-mode cluster JavaWordCount.jar

childMainclass: org.apache.spark.deploy.rest.RestSubmissionClient (if rest, else org.apache.spark.deploy.Client)

Taking standalone spark as example, here is the client mode workflow. The mainclass run in the driver application which could be reside out of the cluster.

On cluster mode showed as below, SparkSubmit will register driver in the cluster, and a driver process launched in one work running the main class.

There are also two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

Cluster deploy mode is not applicable to Spark shells.

二. 部署注意事项

Bundling Your Application’s Dependencies

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

For Python, you can use the --py-files argument of spark-submit to add .py.zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

备注:

1.必须将项目打包成assembly jars 的形式

2.可以使用maven assembly插件进行打包

3.如果是使用idea开发,可以使用idea的buid方式进行打包

Launching Applications with spark-submit

Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

./bin/spark-submit   --class <main-class>
  --master <master-url>   --deploy-mode <deploy-mode>   --conf <key>=<value>   ... # other options
  <application-jar>   [application-arguments]

Some of the commonly used options are:

  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)*
  • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
  • application-arguments: Arguments passed to the main method of your main class, if any

*A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the client spark-submit process, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications.

备注:

1.cluster 模式是支持程序自动重启的.

2.重要的事说三遍,重要的事说三遍,重要的事说三遍: cluster 模式不支持standalone clusters, Mesos clusters, or python applications模式.

三.关于checkpoint

1. spark streaming 的checkpoint 有点坑,如果程序有升级,代码结构有变化,重新部署的时候需要删除checkpoint文件夹,不然会报错。但是删除了checkpoint 文件夹,程序里的rdd状态会丢失。

2.spark streaming checkpoint的升级方案是使用structed streaming 的checkpoint ,structed streaming 的checkpoint 支持程序升级。

时间: 2024-11-01 18:47:04

spark 体验点滴-client 与 cluster 部署的相关文章

Apache Spark技术实战之6 --Standalone部署模式下的临时文件清理

问题导读 1.在Standalone部署模式下,Spark运行过程中会创建哪些临时性目录及文件? 2.在Standalone部署模式下分为几种模式? 3.在client模式和cluster模式下有什么不同? 概要 在Standalone部署模式下,Spark运行过程中会创建哪些临时性目录及文件,这些临时目录和文件又是在什么时候被清理,本文将就这些问题做深入细致的解答. 从资源使用的方面来看,一个进程运行期间会利用到这四个方面的资源,分别是CPU,内存,磁盘和网络.进程退出之后,CPU,内存和网络

关于org.apache.spark.deploy.yarn.Client类

目录 前言 主要方法 submitApplication createApplicationSubmissionContext getApplicationReport getClientToken verifyClusterResources copyFileToRemote prepareLocalResources distribute setupLaunchEnv createContainerLaunchContext monitorApplication run @(关于org.ap

MariaDB Galera Cluster 部署(如何快速部署 MariaDB 集群)

MariaDB Galera Cluster 部署(如何快速部署 MariaDB 集群)  OneAPM蓝海讯通7月3日 发布 推荐 4 推荐 收藏 14 收藏,1.1k 浏览 MariaDB 作为 Mysql 的一个分支,在开源项目中已经广泛使用,例如大热的 openstack,所以,为了保证服务的高可用性,同时提高系统的负载能力,集群部署是必不可少的. MariaDB Galera Cluster 介绍 MariaDB 集群是 MariaDB 同步多主机集群.它仅支持 XtraDB/ Inn

Spark cluster 部署

Spark 框架 Spark与Storm的对比对于Storm来说:1.建议在那种需要纯实时,不能忍受1秒以上延迟的场景下使用,比如实时金融系统,要求纯实时进行金融交易和分析2.此外,如果对于实时计算的功能中,要求可靠的事务机制和可靠性机制,即数据的处理完全精准,一条也不能多,一条也不能少,也可以考虑使用Storm3.如果还需要针对高峰低峰时间段,动态调整实时计算程序的并行度,以最大限度利用集群资源(通常是在小型公司,集群资源紧张的情况),也可以考虑用Storm4.如果一个大数据应用系统,它就是纯

MySQL分片高可用集群之MySQL Cluster部署使用

MySQL Cluster 是MySQL官方出品的分布式数据库解决方案,使用的数据库引擎为NDB,跟单机下的MyISAM和Innodb引擎有所不同,操作界面之一就是MySQL,此外提供原生API,可以节省资源并加快执行速度.该方案比业界其他MySQL集群方案在数据量大时有更大优势,开发者使用上跟单库操作几乎无差异,原先使用MySQL的话几乎可以无缝迁移,就可以享受集群带来的力量.当然也有个明显的缺点:内存开销非常大,如果要选择该方案,需要足够的硬件内存资源.下面我们详细地讲述MySQL Clus

Spark、Shark集群安装部署及遇到的问题解决

1.部署环境 OS:Red Hat Enterprise Linux Server release 6.4 (Santiago) Hadoop:Hadoop 2.4.1 Hive:0.11.0 JDK:1.7.0_60 Python:2.6.6(spark集群需要python2.6以上,否则无法在spark集群上运行py) Spark:0.9.1(最新版是1.0.2) Shark:0.9.1(目前最新的版本,但是只能够兼容到spark-0.9.1,见shark 0.9.1 release) Zo

Spark源码分析:多种部署方式之间的区别与联系(1)

从官方的文档我们可以知道,Spark的部署方式有很多种:local.Standalone.Mesos.YARN.....不同部署方式的后台处理进程是不一样的,但是如果我们从代码的角度来看,其实流程都差不多. 从代码中,我们可以得知其实Spark的部署方式其实比官方文档中介绍的还要多,这里我来列举一下: 1.local:这种方式是在本地启动一个线程来运行作业: 2.local[N]:也是本地模式,但是启动了N个线程: 3.local[*]:还是本地模式,但是用了系统中所有的核: 4.local[N

Apache Spark探秘:三种分布式部署方式比较

目前Apache Spark支持三种分布式部署方式,分别是standalone.spark on mesos和 spark on YARN,其中,第一种类似于MapReduce 1.0所采用的模式,内部实现了容错性和资源管理,后两种则是未来发展的趋势,部分容错性和资源管理交由统一的资源管理系统完成:让Spark运行在一个通用的资源管理系统之上,这样可以与其他计算框架,比如MapReduce,公用一个集群资源,最大的好处是降低运维成本和提高资源利用率(资源按需分配).本文将介绍这三种部署方式,并比

MariaDB Galera Cluster 部署

原文  http://code.oneapm.com/database/2015/07/02/mariadb-galera-cluster/MariaDB作为Mysql的一个分支,在开源项目中已经广泛使用,例如大热的openstack,所以,为了保证服务的高可用性,同时提高系统的负载能力,集群部署是必不可少的. MariaDB Galera Cluster 介绍 MariaDB集群是MariaDB同步多主机集群.它仅支持XtraDB/ InnoDB存储引擎(虽然有对MyISAM实验支持 - 看w