hive 之 Cube, Rollup介绍

1. GROUPING SETS

GROUPING SETS作为GROUP BY的子句，允许开发人员在GROUP BY语句后面指定多个统维度，可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来。

为方便理解，以testdb.test_1为例：

hive> use testdb;
hive> desc test_1;

user_id        string ? ? ?id ? ? ? ? ? ? ? ?
device_id ? ? ?string ? ? ?设备类型：手机、平板? ? ? ? ? ? ?
os_id ? ? ? ? ?string ? ? ?操作系统类型：ios、android ? ? ? ? ? ?
app_id ? ? ? ? string? ? ? 手机app_id? ? ? ? ? ? ?
client_v ?     string ? ? ?客户端版本? ? ? ? ? ? ?
channel ? ? ? ?string ? ? ?渠道

grouping sets语句	等价hive语句
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id))?	SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id))	SELECT device_id,os_id,null,count(user_id) FROM test_1 group by?device_id,os_id
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id))	SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id?UNION ALL?SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
select device_id,os_id,app_id,count(user_id) from ?test_1 group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),())	SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id?UNION ALL?SELECT null,os_id,null,count(user_id) FROM test_1 group by os_id?UNION ALL?SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id??UNION ALL?SELECT null,null,null,count(user_id) FROM test_1

2. CUBE函数

cube简称数据魔方，可以实现hive多个任意维度的查询，cube(a,b,c)则首先会对(a,b,c)进行group by，然后依次是(a,b),(a,c),(a),(b,c),(b),(c)，最后在对全表进行group by，cube会统计所选列中值的所有组合的聚合

select device_id,os_id,app_id,client_v,channel,count(user_id)?
from test_1?
group by device_id,os_id,app_id,client_v,channel with cube;

等价于：

SELECT device_id,null,null,null,null ,count(user_id) FROM test_1 group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_1 group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_1 group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_1 group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_1 group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_1 group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_v,null ,count(user_id) FROM test_1 group by client_v
UNION ALL
SELECT device_id,null,null,client_v,null ,count(user_id) FROM test_1 group by device_id,client_v
UNION ALL
SELECT null,os_id,null,client_v,null ,count(user_id) FROM test_1 group by os_id,client_v
UNION ALL
SELECT device_id,os_id,null,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,client_v
UNION ALL
SELECT null,null,app_id,client_v,null ,count(user_id) FROM test_1 group by app_id,client_v
UNION ALL
SELECT device_id,null,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,app_id,client_v
UNION ALL
SELECT null,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by os_id,app_id,client_v
UNION ALL
SELECT device_id,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v
UNION ALL
SELECT null,null,null,null,channel ,count(user_id) FROM test_1 group by channel
UNION ALL
SELECT device_id,null,null,null,channel ,count(user_id) FROM test_1 group by device_id,channel
UNION ALL
SELECT null,os_id,null,null,channel ,count(user_id) FROM test_1 group by os_id,channel
UNION ALL
SELECT device_id,os_id,null,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,channel
UNION ALL
SELECT null,null,app_id,null,channel ,count(user_id) FROM test_1 group by app_id,channel
UNION ALL
SELECT device_id,null,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,app_id,channel
UNION ALL
SELECT null,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by os_id,app_id,channel
UNION ALL
SELECT device_id,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,channel
UNION ALL
SELECT null,null,null,client_v,channel ,count(user_id) FROM test_1 group by client_v,channel
UNION ALL
SELECT device_id,null,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,client_v,channel
UNION ALL
SELECT null,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by os_id,client_v,channel
UNION ALL
SELECT device_id,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,client_v,channel
UNION ALL
SELECT null,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by app_id,client_v,channel
UNION ALL
SELECT device_id,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,app_id,client_v,channel
UNION ALL
SELECT null,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by os_id,app_id,client_v,channel
UNION ALL
SELECT device_id,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v,channel
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_1

3. ROLL UP函数

rollup可以实现从右到左递减多级的统计，显示统计某一层次结构的聚合

select device_id,os_id,app_id,client_v,channel,count(user_id)?
from test_1?
group by device_id,os_id,app_id,client_v,channel with rollup;

等价于：

select device_id,os_id,app_id,client_v,channel,count(user_id)?
from test_1?
group by device_id,os_id,app_id,client_v,channel?
grouping sets ((device_id,os_id,app_id,client_v,channel),(device_id,os_id,app_id,client_v),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

4.Grouping_ID函数

当我们没有统计某一列时，它的值显示为null，这可能与列本身就有null值冲突，这就需要一种方法区分是没有统计还是值本来就是null。（写一个排列组合的算法，就马上理解了，grouping_id其实就是所统计各列二进制和）

例子如下：

Column1 (key)	Column2 (value)
1	NULL
1	1
2	2
3	3
3	NULL
4	5

hql统计：

 ?SELECT key, value, GROUPING_ID, count(*) from T1 GROUP BY key, value WITH ROLLUP

结果如下：

?key	value	GROUPING_ID?	count(*)?
NULL	NULL	0 ? ? 00	6
1	NULL	1 ? ? 10	2
1	NULL	3 ? ? 11	1
1	1	3 ? ? 11	1
2	NULL	1 ? ? 10	1
2	2	3 ? ? 11	1
3	NULL	1 ? ? 10	2
3	NULL	3 ? ? 11	1
3	3	3 ? ? 11	1
4	NULL	1 ? ? 10	1
4	5	3 ? ? 11	1

GROUPING_ID转变为二进制，如果对应位上有值为null，说明这列本身值就是null。（通过类DataFilterNull.py 扫描，可以筛选过滤掉列中null、“”统计结果),

5. 窗口函数

hive窗口函数，感觉大部分都是在模仿oracle，有对oracle熟悉的，应该看下就知道怎么用。

具体参见：http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

参考文章

https://blog.csdn.net/gua___gua/article/details/52523698

原文地址：https://www.cnblogs.com/remainsu/p/11060580.html

时间： 2024-10-26 11:29:29

hive 之 Cube, Rollup介绍

1. GROUPING SETS

2. CUBE函数

3. ROLL UP函数

4.Grouping_ID函数

5. 窗口函数

参考文章

hive 之 Cube, Rollup介绍的相关文章

Hive新功能 Cube, Rollup介绍

Hive分析窗口函数(五) GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

分组 cube rollup NVL (expr1, expr2)

SQL Server ->> GROUPING SETS, CUBE, ROLLUP, GROUPING, GROUPING_ID

grouping sets,cube,rollup,grouping__id,group by

hive的语法命令介绍

Hive入门笔记-----架构以及应用介绍

GROUPING SETS、CUBE、ROLLUP

SQLSERVER中的ALL、PERCENT、CUBE关键字、ROLLUP关键字和GROUPING函数