Hive表生成函数explode讲解

Hive中的表分析函数接受零个或多个输入,然后产生多列或多行输出。

1.explode函数

explode函数以array类型数据输入,然后对数组中的数据进行迭代,返回多行结果,一行一个数组元素值

ARRAY函数是将一列输入转换成一个数组输出。

hive (jimdb)> SELECT ARRAY(1,2,3) FROM dual;
OK
_c0
[1,2,3]
Time taken: 0.448 seconds, Fetched: 1 row(s)

SELECT explode(array(1,2,3)) AS element;

hive (jimdb)> SELECT explode(array(1,2,3)) AS element;
OK
element
1
2
3
Time taken: 0.327 seconds, Fetched: 3 row(s)

创建一张测试表单:

CREATE TABLE udtf_test(name STRING,subordinates ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘
COLLECTION ITEMS TERMINATED BY ‘,‘;

hive (jimdb)> select * from udtf_test;
OK
udtf_test.name udtf_test.subordinates
jim5 ["james","datacloase"]
jim4 ["james","datacloase"]
jim3 ["james","datacloase"]
jim2 ["james","datacloase"]
jim ["james","datacloase"]
Time taken: 0.348 seconds, Fetched: 5 row(s)

我执行下面这条语句,希望将字段subordinates拆分开,新生成一列,但是语句报错:

select name,explode(subordinates) from udtf_test;

hive (jimdb)> select name,explode(subordinates) from udtf_test;
FAILED: SemanticException [Error 10081]: UDTF‘s are not supported outside the SELECT clause, nor nested in expressions

Hive的表生成函数只是生成了一种数据的展示方式,而无法在表中产生一个其他的列。

因此这块需要使用LATERAL VIEW功能来进行处理。LATERAL VIEW将explode生成的结果当做一个视图来处理。

SELECT name, sub 
FROM udtf_test
LATERAL VIEW explode(subordinates) subView AS sub;

在这里LATERAL VIEW 是将 explode结果转换成一个视图subView,在视图中的单列列名定义为sub,然后在查询的时候引用这个列名就能够查到。

hive (jimdb)> SELECT name, sub 
> FROM udtf_test
> LATERAL VIEW explode(subordinates) subView AS sub;
OK
name sub
jim5 james
jim5 datacloase
jim4 james
jim4 datacloase
jim3 james
jim3 datacloase
jim2 james
jim2 datacloase
jim james
jim datacloase
Time taken: 0.399 seconds, Fetched: 10 row(s)

创建一张测试表:

drop table test1;
create table test1(name string,phonenumber string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘;

--需求是过滤掉该表中电话号码中0-9的某个数字在电话号码中出现6次及以上的号码,然后将正常的号码返回。

hive (jimdb)> select * from test1;
OK
test1.name test1.phonenumber
‘jim he‘ ‘18191512076‘
‘xiaosong‘ ‘18392988059‘
‘jingxianghua‘ ‘18118818818‘
‘donghualing‘ ‘17191919999‘

执行语句如下:

SELECT c.name,c.phonenumber
FROM 
(SELECT dd.name,dd.phonenumber,MAX(dd.cn) 
FROM (SELECT d.name,d.phonenumber,d.m, COUNT(*) cn
FROM (SELECT name,phonenumber,m FROM test1 LATERAL VIEW explode(split(phonenumber,‘‘)) n AS m) d 
GROUP BY d.name,d.phonenumber,d.m) dd
GROUP BY dd.name,dd.phonenumber HAVING MAX(dd.cn) <=5) c;

hive (jimdb)> SELECT c.name,c.phonenumber
> FROM 
> (SELECT dd.name,dd.phonenumber,MAX(dd.cn) 
> FROM (SELECT d.name,d.phonenumber,d.m, COUNT(*) cn
> FROM (SELECT name,phonenumber,m FROM test1 LATERAL VIEW explode(split(phonenumber,‘‘)) n AS m) d 
> GROUP BY d.name,d.phonenumber,d.m) dd
> GROUP BY dd.name,dd.phonenumber HAVING MAX(dd.cn) <=5) c;
Query ID = hadoop_20180611200632_14d3d30b-e64f-4aee-a7ca-fffa66049890
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2018-06-11 20:06:35,732 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1118441439_0004
MapReduce Jobs Launched: 
Stage-Stage-1: HDFS Read: 3004 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
c.name c.phonenumber
‘jim he‘ ‘18191512076‘
‘xiaosong‘ ‘18392988059‘
Time taken: 2.872 seconds, Fetched: 2 row(s)

原文地址:https://www.cnblogs.com/nanshanjushi/p/9175607.html

时间: 2024-07-29 23:47:48

Hive表生成函数explode讲解的相关文章

如何快速把hdfs数据动态导入到hive表

1. hdfs 文件 ? {"retCode":1,"retMsg":"Success","data":[{"secID":"000001.XSHE","ticker":"000001","secShortName":"深发展A","exchangeCD":"XSHE"

hive表与外部表的区别

相信很多用户都用过关系型数据库,我们可以在关系型数据库里面创建表(create table),这里要讨论的表和关系型数据库中的表在概念上很类似.我们可以用下面的语句在Hive里面创建一个表: hive> create table wyp(id int, > name string, > age int, > tele string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t' > STORED AS TEX

使用spark对hive表中的多列数据判重

本文处理的场景如下,hive表中的数据,对其中的多列进行判重deduplicate. 1.先解决依赖,spark相关的所有包,pom.xml spark-hive是我们进行hive表spark处理的关键. <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version

优雅的将hbase的数据导入hive表

v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} wgx wgx 2 67 2016-04-02T15:15:00Z 2016-04-02T15:15:00Z 1 233 1332 11 3 1562 15.00 Clean Clean false 7.8 磅 0

连接Oracle与Hadoop(4) Oracle使用OSCH访问Hive表

OSCH是Oracle SQL Connector for Hadoop的缩写,Oracle出品的大数据连接器的一个组件 本文介绍的就是如何使用OSCH从Oracle数据库直接访问Hive表 前提1:在Oracle数据库端,部署好HDFS客户端与OSCH软件,设置好环境变量 #JAVA export JAVA_HOME=/home/oracle/jdk1.8.0_65   #Hadoop export HADOOP_USER_NAME=hadoop export HADOOP_HOME=/hom

根据JSON创建对应的HIVE表

本文提供一种用SCALA把JSON串转换为HIVE表的方法,由于比较简单,只贴代码,不做解释.有问题可以留言探讨 package com.gabry.hiveimport org.json4s._import org.json4s.native.JsonMethods._import scala.io.Source class Json2Hive{ /** * sealed abstract class JValue *case object JNothing extends JValue //

Oracle Bigdata Connector实战2: 使用Oracle Loader for Hadoop加载Hive表到Oracle数据库

部署Hadoop/Hive/OraLoader软件 [[email protected] ~]$ tree -L 1 ├── hadoop-2.6.2 ├── hbase-1.1.2 ├── hive-1.1.1 ├── jdk1.8.0_65 ├── oraloader-3.4.0 配置hive metastore 我们采用MySQL作为hive的metastore,创建MySQL数据库 mysql> create database metastore DEFAULT CHARACTER SE

Hive使用HDFS目录数据创建Hive表分区

描述: Hive表pms.cross_sale_path建立以日期作为分区,将hdfs目录/user/pms/workspace/ouyangyewei/testUsertrack/job1Output/crossSale上的数据,写入该表的$yesterday分区上 表结构: hive -e " set mapred.job.queue.name=pms; drop table if exists pms.cross_sale_path; create external table pms.c

HIVE表数据Kibana展示

如果我们想展示hive中的数据,则可以使用Kibana展示工具,而在这之前需要把hive表数据导入到es中,这就用到了ES-Hadoop插件. 插件安装: 下载地址:https://github.com/elasticsearch/elasticsearch-hadoop#readme add上面的jar包到hive hive –e "add jar elasticsearch-hadoop-2.1.1.jar;" 假如我们现在想把表dms.visit_path表中的数据展示,步骤如下