Spark修炼之道(进阶篇)——Spark入门到精通:第九节 Spark SQL运行流程解析

1.整体运行流程

使用下列代码对SparkSQL流程进行分析,让大家明白LogicalPlan的几种状态,理解SparkSQL整体执行流程

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

(1)查看teenagers的Schema信息

scala> teenagers.printSchema
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = false)

(2)查看运行流程

scala> teenagers.queryExecution
res3: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
‘Project [unresolvedalias(‘name),unresolvedalias(‘age)]
 ‘Filter ((‘age >= 13) && (‘age <= 19))
  ‘UnresolvedRelation [people], None

== Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
 Filter ((age#1 >= 13) && (age#1 <= 19))
  Subquery people
   LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 Scan PhysicalRDD[name#0,age#1]

Code Generation: true

QueryExecution中表示的是整体Spark SQL运行流程,从上面的输出结果可以看到,一个SQL语句要执行需要经过下列步骤:

== (1)Parsed Logical Plan ==
‘Project [unresolvedalias(‘name),unresolvedalias(‘age)]
 ‘Filter ((‘age >= 13) && (‘age <= 19))
  ‘UnresolvedRelation [people], None

== (2)Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
 Filter ((age#1 >= 13) && (age#1 <= 19))
  Subquery people
   LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== (3)Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== (4)Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 Scan PhysicalRDD[name#0,age#1]

//启动动态字节码生成技术(bytecode generation,CG),提升查询效率
Code Generation: true

2.全表查询运行流程

执行语句:

val all= sqlContext.sql("SELECT * FROM people")

运行流程:

scala> all.queryExecution
res9: org.apache.spark.sql.SQLContext#QueryExecution =
//注意*号被解析为unresolvedalias(*)
== Parsed Logical Plan ==
‘Project [unresolvedalias(*)]
 ‘UnresolvedRelation [people], None

== Analyzed Logical Plan ==
//unresolvedalias(*)被analyzed为Schema中所有的字段
//UnresolvedRelation [people]被analyzed为Subquery people
name: string, age: int
Project [name#0,age#1]
 Subquery people
  LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== Optimized Logical Plan ==
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== Physical Plan ==
Scan PhysicalRDD[name#0,age#1]

Code Generation: true

3. filter查询运行流程

执行语句:

scala> val filterQuery= sqlContext.sql("SELECT * FROM people WHERE age >= 13 AND age <= 19")
filterQuery: org.apache.spark.sql.DataFrame = [name: string, age: int]

执行流程:

scala> filterQuery.queryExecution
res0: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
‘Project [unresolvedalias(*)]
 ‘Filter ((‘age >= 13) && (‘age <= 19))
  ‘UnresolvedRelation [people], None

== Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
 //多出了Filter,后同
 Filter ((age#1 >= 13) && (age#1 <= 19))
  Subquery people
   LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20

== Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20

== Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 Scan PhysicalRDD[name#0,age#1]

Code Generation: true

4. join查询运行流程

执行语句:

val joinQuery= sqlContext.sql("SELECT * FROM people a, people b where a.age=b.age")

查看整体执行流程

scala> joinQuery.queryExecution
res0: org.apache.spark.sql.SQLContext#QueryExecution =
//注意Filter
//Join Inner
== Parsed Logical Plan ==
‘Project [unresolvedalias(*)]
 ‘Filter (‘a.age = ‘b.age)
  ‘Join Inner, None
   ‘UnresolvedRelation [people], Some(a)
   ‘UnresolvedRelation [people], Some(b)

== Analyzed Logical Plan ==
name: string, age: int, name: string, age: int
Project [name#0,age#1,name#2,age#3]
 Filter (age#1 = age#3)
  Join Inner, None
   Subquery a
    Subquery people
     LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22
   Subquery b
    Subquery people
     LogicalRDD [name#2,age#3], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== Optimized Logical Plan ==
Project [name#0,age#1,name#2,age#3]
 Join Inner, Some((age#1 = age#3))
  LogicalRDD [name#0,age#1], MapPartitionsRDD[4]...

//查看其Physical Plan
scala> joinQuery.queryExecution.sparkPlan
res16: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [name#0,age#1,name#2,age#3]
 SortMergeJoin [age#1], [age#3]
  Scan PhysicalRDD[name#0,age#1]
  Scan PhysicalRDD[name#2,age#3]

前面的例子与下面的例子等同,只不过其运行方式略有不同,执行语句:

scala> val innerQuery= sqlContext.sql("SELECT * FROM people a inner join people b on a.age=b.age")
innerQuery: org.apache.spark.sql.DataFrame = [name: string, age: int, name: string, age: int]

查看整体执行流程:

scala> innerQuery.queryExecution
res2: org.apache.spark.sql.SQLContext#QueryExecution =
//注意Join Inner
//另外这里面没有Filter
== Parsed Logical Plan ==
‘Project [unresolvedalias(*)]
 ‘Join Inner, Some((‘a.age = ‘b.age))
  ‘UnresolvedRelation [people], Some(a)
  ‘UnresolvedRelation [people], Some(b)

== Analyzed Logical Plan ==
name: string, age: int, name: string, age: int
Project [name#0,age#1,name#4,age#5]
 Join Inner, Some((age#1 = age#5))
  Subquery a
   Subquery people
    LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22
  Subquery b
   Subquery people
    LogicalRDD [name#4,age#5], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

//注意Optimized Logical Plan与Analyzed Logical Plan
//并没有进行特别的优化,突出这一点是为了比较后面的子查询
//其Analyzed和Optimized间的区别
== Optimized Logical Plan ==
Project [name#0,age#1,name#4,age#5]
 Join Inner, Some((age#1 = age#5))
  LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder ...

//查看其Physical Plan
scala> innerQuery.queryExecution.sparkPlan
res14: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [name#0,age#1,name#6,age#7]
 SortMergeJoin [age#1], [age#7]
  Scan PhysicalRDD[name#0,age#1]
  Scan PhysicalRDD[name#6,age#7]

5. 子查询运行流程

执行语句:

scala> val subQuery=sqlContext.sql("SELECT * FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19")
subQuery: org.apache.spark.sql.DataFrame = [name: string, age: int]

查看整体执行流程:


scala> subQuery.queryExecution
res4: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
‘Project [unresolvedalias(*)]
 ‘Filter (‘a.age <= 19)
  ‘Subquery a
   ‘Project [unresolvedalias(*)]
    ‘Filter (‘age >= 13)
     ‘UnresolvedRelation [people], None

== Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
 Filter (age#1 <= 19)
  Subquery a
   Project [name#0,age#1]
    Filter (age#1 >= 13)
     Subquery people
      LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

//这里需要注意Optimized与Analyzed间的区别
//Filter被进行了优化
== Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
 Scan PhysicalRDD[name#0,age#1]

Code Generation: true

6. 聚合SQL运行流程

执行语句:

scala> val aggregateQuery=sqlContext.sql("SELECT a.name,sum(a.age) FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19 group by a.name")
aggregateQuery: org.apache.spark.sql.DataFrame = [name: string, _c1: bigint]

运行流程查看:


scala> aggregateQuery.queryExecution
res6: org.apache.spark.sql.SQLContext#QueryExecution =
//注意‘Aggregate [‘a.name], [unresolvedalias(‘a.name),unresolvedalias(‘sum(‘a.age))]
//即group by a.name被 parsed为unresolvedalias(‘a.name)
== Parsed Logical Plan ==
‘Aggregate [‘a.name], [unresolvedalias(‘a.name),unresolvedalias(‘sum(‘a.age))]
 ‘Filter (‘a.age <= 19)
  ‘Subquery a
   ‘Project [unresolvedalias(*)]
    ‘Filter (‘age >= 13)
     ‘UnresolvedRelation [people], None

== Analyzed Logical Plan ==
name: string, _c1: bigint
Aggregate [name#0], [name#0,sum(cast(age#1 as bigint)) AS _c1#9L]
 Filter (age#1 <= 19)
  Subquery a
   Project [name#0,age#1]
    Filter (age#1 >= 13)
     Subquery people
      LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22

== Optimized Logical Plan ==
Aggregate [name#0], [name#0,sum(cast(age#1 as bigint)) AS _c1#9L]
 Filter ((age#1 >= 13) && (age#1 <= 19))
  LogicalRDD [name#0,age#1], MapPartitions...

//查看其Physical Plan
scala> aggregateQuery.queryExecution.sparkPlan
res10: org.apache.spark.sql.execution.SparkPlan =
TungstenAggregate(key=[name#0], functions=[(sum(cast(age#1 as bigint)),mode=Final,isDistinct=false)], output=[name#0,_c1#14L])
 TungstenAggregate(key=[name#0], functions=[(sum(cast(age#1 as bigint)),mode=Partial,isDistinct=false)], output=[name#0,currentSum#17L])
  Filter ((age#1 >= 13) && (age#1 <= 19))
   Scan PhysicalRDD[name#0,age#1]

其它SQL语句,大家可以使用同样的方法查看其执行流程,以掌握Spark SQL背后实现的基本思想。

时间: 2024-08-10 21:29:21

Spark修炼之道(进阶篇)——Spark入门到精通:第九节 Spark SQL运行流程解析的相关文章

Spark修炼之道——Spark学习路线、课程大纲

课程内容 Spark修炼之道(基础篇)--Linux基础(15讲).Akka分布式编程(8讲) Spark修炼之道(进阶篇)--Spark入门到精通(30讲) Spark修炼之道(实战篇)--Spark应用开发实战篇(20讲) Spark修炼之道(高级篇)--Spark源代码解析(50讲) 部分内容会在实际编写时动态调整.或补充.或删除. Spark修炼之道(基础篇)--Linux大数据开发基础(15讲). Linux大数据开发基础--第一节:Ubuntu Linux安装与介绍 Linux大数据

Spark修炼之道系列教程预告

课程内容 Spark修炼之道(基础篇)--Linux基础(12讲) Spark修炼之道(进阶篇)--Spark入门到精通(30讲) Spark修炼之道(实战篇)--Spark应用开发实战篇(20讲) Spark修炼之道(高级篇)--Spark源码解析(50讲) 版权声明:本文为博主原创文章,未经博主允许不得转载.

Spark修炼之道(进阶篇)——Spark入门到精通:第一节 Spark 1.5.0集群搭建

作者:周志湖 网名:摇摆少年梦 微信号:zhouzhihubeyond 本节主要内容 操作系统环境准备 Hadoop 2.4.1集群搭建 Spark 1.5.0 集群部署 注:在利用CentOS 6.5操作系统安装spark 1.5集群过程中,本人发现Hadoop 2.4.1集群可以顺利搭建,但在Spark 1.5.0集群启动时出现了问题(可能原因是64位操作系统原因,源码需要重新编译,但本人没经过测试),经本人测试在ubuntu 10.04 操作系统上可以顺利成功搭建.大家可以利用CentOS

Spark修炼之道(进阶篇)——Spark入门到精通:第十三节 Spark Streaming—— Spark SQL、DataFrame与Spark Streaming

主要内容 Spark SQL.DataFrame与Spark Streaming 1. Spark SQL.DataFrame与Spark Streaming 源码直接参照:https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala import org.apache.spark.SparkConf

Spark修炼之道(高级篇)——Spark源码阅读:第二节 SparkContext的创建

博文推荐:http://blog.csdn.net/anzhsoft/article/details/39268963,由大神张安站写的Spark架构原理,使用Spark版本为1.2,本文以Spark 1.5.0为蓝本,介绍Spark应用程序的执行流程. 本文及后面的源码分析都以下列代码为样板 import org.apache.spark.{SparkConf, SparkContext} object SparkWordCount{ def main(args: Array[String])

Spark修炼之道(高级篇)——Spark源码阅读:第八节 Task执行

Task执行 在上一节中,我们提到在Driver端CoarseGrainedSchedulerBackend中的launchTasks方法向Worker节点中的Executor发送启动任务命令,该命令的接收者是CoarseGrainedExecutorBackend(Standalone模式),类定义源码如下: private[spark] class CoarseGrainedExecutorBackend( override val rpcEnv: RpcEnv, driverUrl: St

Scala入门到精通——第一节 Scala语言初步

本节主要内容 Scala简介 为什么要学习Scala Scala语言初步 1. Scala简介 Scala(Scala Language的简称)语言是一种能够运行于JVM和.Net平台之上的通用编程语言,既可用于大规模应用程序开发,也可用于脚本编程,它由由Martin Odersk于2001开发,2004年开始程序运行在JVM与.Net平台之上,由于其简洁.优雅.类型安全的编程模式而受到关注. Scala的创建者--Martin Odersk 在Scala的创建之初,并没有怎么引起重视,随着Ap

Spark修炼之道(进阶篇)——Spark入门到精通:第十节 Spark SQL案例实战(一)

作者:周志湖 放假了,终于能抽出时间更新博客了--. 1. 获取数据 本文通过将github上的Spark项目git日志作为数据,对SparkSQL的内容进行详细介绍 数据获取命令如下: [[email protected] spark]# git log --pretty=format:'{"commit":"%H","author":"%an","author_email":"%ae"

Spark修炼之道(进阶篇)——Spark入门到精通:第十五节 Kafka 0.8.2.1 集群搭建

作者:周志湖 微信号:zhouzhihubeyond 本节为下一节Kafka与Spark Streaming做铺垫 主要内容 1.kafka 集群搭建 1. kafka 集群搭建 kafka 安装与配置 到下面的地址下载:Scala 2.10 - kafka_2.10-0.8.2.1.tgz http://kafka.apache.org/downloads.html 下载完成后,使用命令 tar -zxvf kafka_2.10-0.8.2.1.tgz 解压,解压后的目录如下 进入config