Spark SQL 使用beeline访问hive仓库

一、添加hive-site.xml

在$SPARK_HOME/conf下添加hive-site.xml的配置文件,目的是能正常访问hive的元数据

vim hive-site.xml
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://192.168.1.201:3306/hiveDB?createDatabaseIfNotExist=true</value>
        </property>

    <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
            <value>root</value>
        </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
            <value>123456</value>
        </property>
        <!-- hive查询时输出列名 -->
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>
    <!-- 显示当前数据库名 -->
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
    </property>
</configuration>

注意:在节点上不需要部署hive,只要是你可以连接到hive的元数据就可以!

二、启动thriftserver服务

[[email protected] spark]$ ./sbin/start-thriftserver.sh --jars ~/softwares/mysql-connector-java-5.1.47.jar
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2,
logging to /home/hadoop/app/spark/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop003.out

检查日志,确认thriftserver服务正常启动

[[email protected] spark]$ tail -50f /home/hadoop/app/spark/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop003.out

19/05/21 09:39:14 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/05/21 09:39:15 INFO metastore.ObjectStore: Initialized ObjectStore
19/05/21 09:39:15 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
19/05/21 09:39:15 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
19/05/21 09:39:15 INFO metastore.HiveMetaStore: Added admin role in metastore
19/05/21 09:39:15 INFO metastore.HiveMetaStore: Added public role in metastore
19/05/21 09:39:15 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
19/05/21 09:39:15 INFO metastore.HiveMetaStore: 0: get_all_databases
19/05/21 09:39:15 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_all_databases
19/05/21 09:39:15 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
19/05/21 09:39:15 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_functions: db=default pat=*
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:16 INFO session.SessionState: Created local directory: /tmp/73df82dd-1fd3-4dd5-97f1-680d53bd44bc_resources
19/05/21 09:39:16 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc
19/05/21 09:39:16 INFO session.SessionState: Created local directory: /tmp/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc
19/05/21 09:39:16 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc/_tmp_space.db
19/05/21 09:39:16 INFO client.HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/home/hadoop/app/spark-2.4.2-bin-hadoop-2.6.0-cdh5.7.0/spark-warehouse
19/05/21 09:39:16 INFO session.SessionManager: Operation log root directory is created: /tmp/hadoop/operation_logs
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread pool size: 100
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread wait queue size: 100
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread keepalive time: 10 seconds
19/05/21 09:39:16 INFO service.AbstractService: Service:OperationManager is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:SessionManager is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service: CLIService is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:ThriftBinaryCLIService is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service: HiveServer2 is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:OperationManager is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:SessionManager is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:CLIService is started.
19/05/21 09:39:16 INFO metastore.ObjectStore: ObjectStore, initialize called
19/05/21 09:39:16 INFO DataNucleus.Query: Reading in results for query "[email protected]" since the connection used is closing
19/05/21 09:39:16 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/05/21 09:39:16 INFO metastore.ObjectStore: Initialized ObjectStore
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: get_databases: default
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_databases: default
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: Shutting down the object store...
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=Shutting down the object store...
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: Metastore shutdown complete.
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=Metastore shutdown complete.
19/05/21 09:39:16 INFO service.AbstractService: Service:ThriftBinaryCLIService is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:HiveServer2 is started.
19/05/21 09:39:16 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver/json,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver/session,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver/session/json,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO thrift.ThriftCLIService:
Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads#标志启动成功

三、启动beeline

[[email protected] spark]$ ./bin/beeline -u jdbc:hive2://localhost:10000 -n hadoop
Connecting to jdbc:hive2://localhost:10000
19/05/21 09:46:19 INFO jdbc.Utils: Supplied authorities: localhost:10000
19/05/21 09:46:19 INFO jdbc.Utils: Resolved authority: localhost:10000
19/05/21 09:46:19 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.4.2)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:10000> select * from student.student limit 5;
+---------+-----------+-----------------+--------------------------------------------+--+
| stu_id  | stu_name  |  stu_phone_num  |                 stu_email                  |
+---------+-----------+-----------------+--------------------------------------------+--+
| 1       | Burke     | 1-300-746-8446  | [email protected]  |
| 2       | Kamal     | 1-668-571-5046  | [email protected]          |
| 3       | Olga      | 1-956-311-1686  | [email protected]     |
| 4       | Belle     | 1-246-894-6340  | [email protected]              |
| 5       | Trevor    | 1-300-527-4967  | [email protected]             |
+---------+-----------+-----------------+--------------------------------------------+--+
5 rows selected (3.275 seconds)
0: jdbc:hive2://localhost:10000> 

启动成功

四、注意

1、最好在spark/bin目录下启动beeline
因为如果你启动sparkbeeline的机器还部署了hive,恰巧你的hive环境变量正好在spark环境变量之前,那么很可能启动的是hive的beeline
比如:

[[email protected] spark]$ beeline
ls: cannot access /home/hadoop/app/spark/lib/spark-assembly-*.jar: No such file or directory
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/spark/bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0//bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0//sbin:/home/hadoop/app/zookeeper/bin:/usr/java/jdk1.8.0_131/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
Beeline version 1.1.0-cdh5.7.0 by Apache Hive  # 这不就是hive么
beeline> 

此时你查看下环境变量

[[email protected] spark]$ cat ~/.bash_profile
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin

export PATH
#####JAVA_HOME#####
export JAVA_HOME=/usr/java/jdk1.8.0_131

####ZOOKEEPER_HOME####
export ZOOKEEPER_HOME=/home/hadoop/app/zookeeper

#####HADOOP_HOME######
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/

export SPARK_HOME=/home/hadoop/app/spark

#####HIVE_HOME#####
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$HIVE_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$JAVA_HOME/bin:$PATH

果然如果不指定beeline路径就会优先使用hive的beeline

原文地址:https://blog.51cto.com/14309075/2398256

时间: 2024-08-02 18:05:35

Spark SQL 使用beeline访问hive仓库的相关文章

Spark&amp;Hive:如何使用scala开发spark作业,并访问hive。

背景: 接到任务,需要在一个一天数据量在460亿条记录的hive表中,筛选出某些host为特定的值时才解析该条记录的http_content中的经纬度: 解析规则譬如: 需要解析host: api.map.baidu.com 需要解析的规则:"result":{"location":{"lng":120.25088311933617,"lat":30.310684375444877}, "confidence&quo

【SparkSQL】介绍、与Hive整合、Spark的th/beeline/jdbc/thriftserve2、shell方式使用SQL

目录 一.Spark SQL介绍 二.Spark和Hive的整合 三.Spark的thriftserve2/beeline/jdbc 四.shell方式使用SQL 一.Spark SQL介绍 官网:http://spark.apache.org/sql/ 学习文档:http://spark.apache.org/docs/latest/sql-programming-guide.html SQL on Hadoop框架: 1)Spark SQL 2)Hive 3)Impala 4)Phoenix

前世今生:Hive、Shark、spark SQL

Hive (http://en.wikipedia.org/wiki/Apache_Hive )(非严格的原文顺序翻译)  Apache Hive是一个构建在Hadoop上的数据仓库框架,它提供数据的概要信息.查询和分析功能.最早是Facebook开发的,现在也被像Netflix这样的公司使用.Amazon维护了一个为自己定制的分支.   Hive提供了一个类SQL的语音--HiveQL,它将对关系数据库的模式操作转换为Hadoop的map/reduce.Apache Tez和Spark 执行引

Apache Spark 2.2.0 中文文档 - Spark SQL, DataFrames and Datasets Guide | ApacheCN

Spark SQL, DataFrames and Datasets Guide Overview SQL Datasets and DataFrames 开始入门 起始点: SparkSession 创建 DataFrames 无类型的Dataset操作 (aka DataFrame 操作) Running SQL Queries Programmatically 全局临时视图 创建Datasets RDD的互操作性 使用反射推断Schema 以编程的方式指定Schema Aggregatio

Spark1.1.0 Spark SQL Programming Guide

Spark SQL Programming Guide Overview Getting Started Data Sources RDDs Inferring the Schema Using Reflection Programmatically Specifying the Schema Parquet Files Loading Data Programmatically Configuration JSON Datasets Hive Tables Performance Tuning

spark SQL概述

Spark SQL是什么? 何为结构化数据 sparkSQL与spark Core的关系 Spark SQL的前世今生:由Shark发展而来 Spark SQL的前世今生:可以追溯到Hive Spark SQL的前世今生:Hive 到Shark(在Hive上做改进) Spark SQL的前世今生:Shark 到Spark SQL(彻底摆脱但是兼容Hive) Spark SQL的前世今生:Hive 到Hive on Spark

第56课:Spark SQL和DataFrame的本质

一.Spark SQL与Dataframe Spark SQL之所以是除Spark core以外最大和最受关注的组件的原因: a) 能处理一切存储介质和各种格式的数据(你同时可以方便的扩展Spark SQL的功能来支持更多的数据类型,例如KUDO) b)Spark SQL 把数据仓库的计算能力推向了一个新的高度.不仅是无敌的计算速度(Spark SQL比Shark快了一个数量级,Shark比Hive快了一个数量级),尤其是在tungsten成熟以后会更加无可匹敌.更为重要的是把数据仓库的计算复杂

Spark SQL CLI 实现分析

背景 本文主要介绍了Spark SQL里目前的CLI实现,代码之后肯定会有不少变动,所以我关注的是比较核心的逻辑.主要是对比了Hive CLI的实现方式,比较Spark SQL在哪块地方做了修改,哪些地方与Hive CLI是保持一致的.可以先看下总结一节里的内容. Spark SQL的hive-thriftserver项目里是其CLI实现代码,下面先说明Hive CLI的主要实现类和关系,再说明Spark SQL CLI的做法. Hive CLI 核心启动类是org.apache.hive.se

spark sql cli 配置使用

想使用spark sql cli 直接读取hive中表来做分析的话只需要简答的几部设置就可以了 1.拷贝hive-site.xml 至spark conf cd /usr/local/hive/conf/hive-site.xml /usr/local/spark-1.5.1/conf/ 2.配置spark classpath ,添加mysql驱动类 $ vim conf/spark-env.sh export SPARK_CLASSPATH=$SPARK_CLASSPATH:$SPARK_LO