hive 之 Cube, Rollup介绍

1. GROUPING SETS

GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统维度,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来。

为方便理解,以testdb.test_1为例:

hive> use testdb;
hive> desc test_1;

user_id        string ? ? ?id ? ? ? ? ? ? ? ?
device_id ? ? ?string ? ? ?设备类型:手机、平板? ? ? ? ? ? ?
os_id ? ? ? ? ?string ? ? ?操作系统类型:ios、android ? ? ? ? ? ?
app_id ? ? ? ? string? ? ? 手机app_id? ? ? ? ? ? ?
client_v ?     string ? ? ?客户端版本? ? ? ? ? ? ?
channel ? ? ? ?string ? ? ?渠道
grouping sets语句 等价hive语句
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id))? SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_1 group by?device_id,os_id
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id?UNION ALL?SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id?UNION ALL?SELECT null,os_id,null,count(user_id) FROM test_1 group by os_id?UNION ALL?SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id??UNION ALL?SELECT null,null,null,count(user_id) FROM test_1

2. CUBE函数

cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,cube会统计所选列中值的所有组合的聚合

select device_id,os_id,app_id,client_v,channel,count(user_id)?
from test_1?
group by device_id,os_id,app_id,client_v,channel with cube;

等价于:

SELECT device_id,null,null,null,null ,count(user_id) FROM test_1 group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_1 group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_1 group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_1 group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_1 group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_1 group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_v,null ,count(user_id) FROM test_1 group by client_v
UNION ALL
SELECT device_id,null,null,client_v,null ,count(user_id) FROM test_1 group by device_id,client_v
UNION ALL
SELECT null,os_id,null,client_v,null ,count(user_id) FROM test_1 group by os_id,client_v
UNION ALL
SELECT device_id,os_id,null,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,client_v
UNION ALL
SELECT null,null,app_id,client_v,null ,count(user_id) FROM test_1 group by app_id,client_v
UNION ALL
SELECT device_id,null,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,app_id,client_v
UNION ALL
SELECT null,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by os_id,app_id,client_v
UNION ALL
SELECT device_id,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v
UNION ALL
SELECT null,null,null,null,channel ,count(user_id) FROM test_1 group by channel
UNION ALL
SELECT device_id,null,null,null,channel ,count(user_id) FROM test_1 group by device_id,channel
UNION ALL
SELECT null,os_id,null,null,channel ,count(user_id) FROM test_1 group by os_id,channel
UNION ALL
SELECT device_id,os_id,null,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,channel
UNION ALL
SELECT null,null,app_id,null,channel ,count(user_id) FROM test_1 group by app_id,channel
UNION ALL
SELECT device_id,null,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,app_id,channel
UNION ALL
SELECT null,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by os_id,app_id,channel
UNION ALL
SELECT device_id,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,channel
UNION ALL
SELECT null,null,null,client_v,channel ,count(user_id) FROM test_1 group by client_v,channel
UNION ALL
SELECT device_id,null,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,client_v,channel
UNION ALL
SELECT null,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by os_id,client_v,channel
UNION ALL
SELECT device_id,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,client_v,channel
UNION ALL
SELECT null,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by app_id,client_v,channel
UNION ALL
SELECT device_id,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,app_id,client_v,channel
UNION ALL
SELECT null,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by os_id,app_id,client_v,channel
UNION ALL
SELECT device_id,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v,channel
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_1

3. ROLL UP函数

rollup可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合

select device_id,os_id,app_id,client_v,channel,count(user_id)?
from test_1?
group by device_id,os_id,app_id,client_v,channel with rollup;

等价于:

select device_id,os_id,app_id,client_v,channel,count(user_id)?
from test_1?
group by device_id,os_id,app_id,client_v,channel?
grouping sets ((device_id,os_id,app_id,client_v,channel),(device_id,os_id,app_id,client_v),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

4.Grouping_ID函数

当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)

例子如下:

Column1 (key) Column2 (value)
1 NULL
1 1
2 2
3 3
3 NULL
4 5

hql统计:

 ?SELECT key, value, GROUPING_ID, count(*) from T1 GROUP BY key, value WITH ROLLUP

结果如下:

?key value GROUPING_ID? count(*)?
NULL NULL 0 ? ? 00 6
1 NULL 1 ? ? 10 2
1 NULL 3 ? ? 11 1
1 1 3 ? ? 11 1
2 NULL 1 ? ? 10 1
2 2 3 ? ? 11 1
3 NULL 1 ? ? 10 2
3 NULL 3 ? ? 11 1
3 3 3 ? ? 11 1
4 NULL 1 ? ? 10 1
4 5 3 ? ? 11 1

GROUPING_ID转变为二进制,如果对应位上有值为null,说明这列本身值就是null。(通过类DataFilterNull.py 扫描,可以筛选过滤掉列中null、“”统计结果),

5. 窗口函数

hive窗口函数,感觉大部分都是在模仿oracle,有对oracle熟悉的,应该看下就知道怎么用。

具体参见:http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

参考文章

  1. https://blog.csdn.net/gua___gua/article/details/52523698

原文地址:https://www.cnblogs.com/remainsu/p/11060580.html

时间: 2024-10-26 11:29:29

hive 之 Cube, Rollup介绍的相关文章

Hive新功能 Cube, Rollup介绍

说明:Hive之cube.rollup,还有窗口函数,在传统关系型数据(Oracle.sqlserver)中都是有的,用法都很相似. GROUPING SETS GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解, 以acorn_3g.test_xinyan_reg为例: [[email protected] xjob]

Hive分析窗口函数(五) GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

1.GROUPING SETS与另外哪种方式等价? 2.根据GROUP BY的维度的所有组合进行聚合由哪个关键字完成? 3.ROLLUP与ROLLUP关系是什么? GROUPING SETS,GROUPING__ID,CUBE,ROLLUP 这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时.天.月的UV数. Hive版本为 apache-hive-0.13.1 数据准备: 2015-03,2015-03-10,cookie1 2015-03,20

分组 cube rollup NVL (expr1, expr2)

cube  rollup NVL (expr1, expr2)->expr1为NULL,返回expr2:不为NULL,返回expr1.注意两者的类型要一致 NVL2 (expr1, expr2, expr3) ->expr1不为NULL,返回expr2:为NULL,返回expr3.expr2和expr3类型不同的话,expr3会转换为expr2的类型 NULLIF (expr1, expr2) ->相等返回NULL,不等返回expr1

SQL Server ->> GROUPING SETS, CUBE, ROLLUP, GROUPING, GROUPING_ID

在我们制作报表的时候常常需要分组聚合.多组聚合和总合.如果通过另外的T-SQL语句来聚合难免性能太差.如果通过报表工具的聚合功能虽说比使用额外的T-SQL语句性能上要好很多,不过不够干脆,还是需要先生成整个结果集然后再聚合,而且最最重要的时很多情况下报表的聚合功能可能没办法达到我们需要的效果.GROUPING SETS, CUBE, ROLLUP, GROUPING, GROUPING_ID这几个聚合函数的作用就是在原始语句的基础上完成很多像财务报表需要的聚合功能. GROUPING SETS相

grouping sets,cube,rollup,grouping__id,group by

例1: hive -e" select type ,status ,count(1) from usr_info where pt='2015-09-14' group by type,status grouping sets ((type,status),( type),()); ">one.txt Grouping sets按照各种指定聚类汇总方式,如group by type,status grouping sets ((type,status),( type),()) 表

hive的语法命令介绍

1.hive的基本语法: create databases mydb #创建数据库 show databases #查看所有的库 use mydb #切换数据库 create table t_user(id int ,name string,age int) #创建表 create table t_user(id int ,name string,age int) row format delimited fields terminated by '分隔符' #指定分隔符的建表语句 insert

Hive入门笔记-----架构以及应用介绍

Hive这个框架在Hadoop的生态体系结构中占有及其重要的地位,在实际的业务当中用的也非常多,可以说Hadoop之所以这么流行在很大程度上是因为Hive的存在.那么Hive究竟是什么,为什么在Hadoop家族中占有这么重要的地位,本篇文章将围绕Hive的体系结构(架构).Hive的操作.Hive与Hbase的区别等对Hive进行全方面的阐述. 在此之前,先给大家介绍一个业务场景,让大家感受一下为什么Hive如此的受欢迎: 业务描述:统计业务表consumer.txt中北京的客户有多少位?下面是

GROUPING SETS、CUBE、ROLLUP

其实还是写一个Demo 比较好 USE tempdb IF OBJECT_ID( 'dbo.T1' , 'U' )IS NOT NULL BEGIN DROP TABLE dbo.T1; END; GO CREATE TABLE dbo.T1 ( id INT , productName VARCHAR(200) , price MONEY , num INT , amount INT , operatedate DATETIME ) GO DECLARE @i INT DECLARE @ran

SQLSERVER中的ALL、PERCENT、CUBE关键字、ROLLUP关键字和GROUPING函数

原文:SQLSERVER中的ALL.PERCENT.CUBE关键字.ROLLUP关键字和GROUPING函数 SQLSERVER中的ALL.PERCENT.CUBE关键字.ROLLUP关键字和GROUPING函数 先来创建一个测试表 1 USE [tempdb] 2 GO 3 4 CREATE TABLE #temptb(id INT ,NAME VARCHAR(200)) 5 GO 6 7 INSERT INTO [#temptb] ( [id], [NAME] ) 8 SELECT 1,'中