实战Spark分布式SQL引擎

一、概览

Spark SQL除了使用spark-sql命令进入交互式执行环境之外，还能够使用JDBC/ODBC或命令行接口进行分布式查询，在这个模式下，终端用户或应用可以直接和Spark SQL进行交互式SQL查询而不需要写任何scala代码。

二、使用Thrift JDBC server

spark版本：1.4.0

Yarn版本：CDH5.4.0

1、准备工作

将hive-site.xml拷贝或link到$SPARK_HOME/conf下

2、使用spark安装目录下脚本启动hive thrift server，默认不加参数时，会以local模式启动，占用本地一个JVM进程


sbin/start-thriftserver.sh

3、yarn-client模式启动，默认启动在10001端口


sbin/start-thriftserver.sh --master yarn

接下来，我们观察yarn UI的UI上，启动了25个container

为什么启动了一个JDBC服务就占用这么多资源呢？这是因为conf/spark-env.sh中配置了SPARK_EXECUTOR_INSTANCES为24个实例，再加上一个yarn client的driver实例


export SPARK_EXECUTOR_INSTANCES=24

观察Yarn NodeManager节点上的进程，thriftserver会常驻一个叫org.apache.spark.executor.CoarseGrainedExecutorBackend的进程，随时为之后的SQL作业启动Task。这样做的好处是运行Spark SQL时，减少了启动container上的时间消耗，同时代价是在thrift server空闲的时候，这些container资源仍然占用着不会释放给其他spark或mapreduce作业使用。

4、使用beeline连接Spark SQL交互式引擎


bin/beeline -u jdbc:hive2://localhost:10001 -n root -p root

注意，在非安全Hadoop模式下，用户名使用当前系统用户，密码为空或随意传值都可以；在kerberos Hadoop模式下，需要传递有效的principal令牌才可以登录beeline。

三、命令行帮助

1、Thrift server


Mandatory arguments to long options are mandatory for short options too.

  -a, --all                  do not ignore entries starting with .

  -A, --almost-all           do not list implied . and ..

      --author               with -l, print the author of each file

  -b, --escape               print octal escapes for nongraphic characters

      --block-size=SIZE      use SIZE-byte blocks.  See SIZE format below

  -B, --ignore-backups       do not list implied entries ending with ~

  -c                         with -lt: sort by, and show, ctime (time of last

                               modification of file status information)

                               with -l: show ctime and sort by name

                               otherwise: sort by ctime

  -C                         list entries by columns

      --color[=WHEN]         colorize the output.  WHEN defaults to `always‘

                               or can be `never‘ or `auto‘.  More info below

  -d, --directory            list directory entries instead of contents,

                               and do not dereference symbolic links

  -D, --dired                generate output designed for Emacs‘ dired mode

  -f                         do not sort, enable -aU, disable -ls --color

  -F, --classify             append indicator (one of */=>@|) to entries

      --file-type            likewise, except do not append `*‘

      --format=WORD          across -x, commas -m, horizontal -x, long -l,

                               single-column -1, verbose -l, vertical -C

      --full-time            like -l --time-style=full-iso

  -g                         like -l, but do not list owner

      --group-directories-first

                             group directories before files.

                               augment with a --sort option, but any

                               use of --sort=none (-U) disables grouping

  -G, --no-group             in a long listing, don‘t print group names

  -h, --human-readable       with -l, print sizes in human readable format

                               (e.g., 1K 234M 2G)

      --si                   likewise, but use powers of 1000 not 1024

  -H, --dereference-command-line

                             follow symbolic links listed on the command line

      --dereference-command-line-symlink-to-dir

                             follow each command line symbolic link

                             that points to a directory

      --hide=PATTERN         do not list implied entries matching shell PATTERN

                               (overridden by -a or -A)

      --indicator-style=WORD  append indicator with style WORD to entry names:

                               none (default), slash (-p),

                               file-type (--file-type), classify (-F)

  -i, --inode                print the index number of each file

  -I, --ignore=PATTERN       do not list implied entries matching shell PATTERN

  -k                         like --block-size=1K

  -l                         use a long listing format

  -L, --dereference          when showing file information for a symbolic

                               link, show information for the file the link

                               references rather than for the link itself

  -m                         fill width with a comma separated list of entries

  -n, --numeric-uid-gid      like -l, but list numeric user and group IDs

  -N, --literal              print raw entry names (don‘t treat e.g. control

                               characters specially)

  -o                         like -l, but do not list group information

  -p, --indicator-style=slash

                             append / indicator to directories

  -q, --hide-control-chars   print ? instead of non graphic characters

      --show-control-chars   show non graphic characters as-is (default

                             unless program is `ls‘ and output is a terminal)

  -Q, --quote-name           enclose entry names in double quotes

      --quoting-style=WORD   use quoting style WORD for entry names:

                               literal, locale, shell, shell-always, c, escape

  -r, --reverse              reverse order while sorting

  -R, --recursive            list subdirectories recursively

  -s, --size                 print the allocated size of each file, in blocks

  -S                         sort by file size

      --sort=WORD            sort by WORD instead of name: none -U,

                             extension -X, size -S, time -t, version -v

      --time=WORD            with -l, show time as WORD instead of modification

                             time: atime -u, access -u, use -u, ctime -c,

                             or status -c; use specified time as sort key

                             if --sort=time

      --time-style=STYLE     with -l, show times using style STYLE:

                             full-iso, long-iso, iso, locale, +FORMAT.

                             FORMAT is interpreted like `date‘; if FORMAT is

                             FORMAT1<newline>FORMAT2, FORMAT1 applies to

                             non-recent files and FORMAT2 to recent files;

                             if STYLE is prefixed with `posix-‘, STYLE

                             takes effect only outside the POSIX locale

  -t                         sort by modification time

  -T, --tabsize=COLS         assume tab stops at each COLS instead of 8

  -u                         with -lt: sort by, and show, access time

                               with -l: show access time and sort by name

                               otherwise: sort by access time

  -U                         do not sort; list entries in directory order

  -v                         natural sort of (version) numbers within text

  -w, --width=COLS           assume screen width instead of current value

  -x                         list entries by lines instead of by columns

  -X                         sort alphabetically by entry extension

  -1                         list one file per line

 

SELinux options:

 

  --lcontext                 Display security context.   Enable -l. Lines

                             will probably be too wide for most displays.

  -Z, --context              Display security context so it fits on most

                             displays.  Displays only mode, user, group,

                             security context and file name.

  --scontext                 Display only security context and file name.

      --help     display this help and exit

      --version  output version information and exit

2、beeline


   -u <database url>               the JDBC URL to connect to

   -n <username>                   the username to connect as

   -p <password>                   the password to connect as

   -d <driver class>               the driver class to use

   -e <query>                      query that should be executed

   -f <file>                       script file that should be executed

   --hiveconf property=value       Use value for given property

   --hivevar name=value            hive variable name and value

                                   This is Hive specific settings in which variables

                                   can be set at session level and referenced in Hive

                                   commands or queries.

   --color=[true/false]            control whether color is used for display

   --showHeader=[true/false]       show column names in query results

   --headerInterval=ROWS;          the interval between which heades are displayed

   --fastConnect=[true/false]      skip building table/column list for tab-completion

   --autoCommit=[true/false]       enable/disable automatic transaction commit

   --verbose=[true/false]          show verbose error messages and debug info

   --showWarnings=[true/false]     display connection warnings

   --showNestedErrs=[true/false]   display nested errors

   --numberFormat=[pattern]        format numbers using DecimalFormat pattern

   --force=[true/false]            continue running script even after errors

   --maxWidth=MAXWIDTH             the maximum width of the terminal

   --maxColumnWidth=MAXCOLWIDTH    the maximum width to use when displaying columns

   --silent=[true/false]           be more silent

   --autosave=[true/false]         automatically save preferences

   --outputformat=[table/vertical/csv/tsv]   format mode for result display

   --isolation=LEVEL               set the transaction isolation level

   --nullemptystring=[true/false]  set to true to get historic behavior of printing null as empty string

   --help                          display this message

时间： 2024-12-13 05:27:27

实战Spark分布式SQL引擎

实战Spark分布式SQL引擎的相关文章

Spark 分布式SQL引擎

DRDS分布式SQL引擎—执行计划介绍

第三代DRDS分布式SQL引擎全新发布

Presto: 可以处理PB级别数据的分布式SQL查询引擎

6大主流开源SQL引擎总结，遥遥领先的是谁？

从分布式分析引擎到分布式存储

Spark的Streaming和Spark的SQL简单入门学习

六大主流开源SQL引擎

HBase场景 | 都是HBase上的SQL引擎，Kylin和Phoenix有什么不同？