Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE

参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm

测试数据准备:

create external table test_data (
cookieid string,
createtime string,  --页面访问时间
url string       --被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,‘
stored as textfile location ‘/user/jc_rc_ftp/test_data‘;

 select * from test_data l;
+-------------+----------------------+---------+--+
| l.cookieid  |     l.createtime     |  l.url  |
+-------------+----------------------+---------+--+
| cookie1     | 2015-04-10 10:00:02  | url2    |
| cookie1     | 2015-04-10 10:00:00  | url1    |
| cookie1     | 2015-04-10 10:03:04  | 1url3   |
| cookie1     | 2015-04-10 10:50:05  | url6    |
| cookie1     | 2015-04-10 11:00:00  | url7    |
| cookie1     | 2015-04-10 10:10:00  | url4    |
| cookie1     | 2015-04-10 10:50:01  | url5    |
| cookie2     | 2015-04-10 10:00:02  | url22   |
| cookie2     | 2015-04-10 10:00:00  | url11   |
| cookie2     | 2015-04-10 10:03:04  | 1url33  |
| cookie2     | 2015-04-10 10:50:05  | url66   |
| cookie2     | 2015-04-10 11:00:00  | url77   |
| cookie2     | 2015-04-10 10:10:00  | url44   |
| cookie2     | 2015-04-10 10:50:01  | url55   |
+-------------+----------------------+---------+--+

LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,‘1970-01-01 00:00:00‘) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid  |      createtime      |   url   | rn  |     last_1_time      |     last_2_time      |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | 1970-01-01 00:00:00  | NULL                 |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | 2015-04-10 10:00:00  | NULL                 |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 2015-04-10 10:00:02  | 2015-04-10 10:00:00  |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | 2015-04-10 10:03:04  | 2015-04-10 10:00:02  |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | 2015-04-10 10:10:00  | 2015-04-10 10:03:04  |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | 2015-04-10 10:50:01  | 2015-04-10 10:10:00  |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | 2015-04-10 10:50:05  | 2015-04-10 10:50:01  |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | 1970-01-01 00:00:00  | NULL                 |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | 2015-04-10 10:00:00  | NULL                 |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 2015-04-10 10:00:02  | 2015-04-10 10:00:00  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | 2015-04-10 10:03:04  | 2015-04-10 10:00:02  |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | 2015-04-10 10:10:00  | 2015-04-10 10:03:04  |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | 2015-04-10 10:50:01  | 2015-04-10 10:10:00  |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | 2015-04-10 10:50:05  | 2015-04-10 10:50:01  |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,‘1970-01-01 00:00:00‘) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid  |      createtime      |   url   | rn  |     next_1_time      |     next_2_time      |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | 2015-04-10 10:00:02  | 2015-04-10 10:03:04  |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | 2015-04-10 10:03:04  | 2015-04-10 10:10:00  |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 2015-04-10 10:10:00  | 2015-04-10 10:50:01  |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | 2015-04-10 10:50:01  | 2015-04-10 10:50:05  |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | 2015-04-10 10:50:05  | 2015-04-10 11:00:00  |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | 2015-04-10 11:00:00  | NULL                 |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | 1970-01-01 00:00:00  | NULL                 |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | 2015-04-10 10:00:02  | 2015-04-10 10:03:04  |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | 2015-04-10 10:03:04  | 2015-04-10 10:10:00  |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 2015-04-10 10:10:00  | 2015-04-10 10:50:01  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | 2015-04-10 10:50:01  | 2015-04-10 10:50:05  |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | 2015-04-10 10:50:05  | 2015-04-10 11:00:00  |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | 2015-04-10 11:00:00  | NULL                 |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | 1970-01-01 00:00:00  | NULL                 |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_data;

+-----------+----------------------+---------+-----+---------+--+
| cookieid  |      createtime      |   url   | rn  | first1  |
+-----------+----------------------+---------+-----+---------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url1    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | url1    |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url1    |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url1    |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url1    |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url1    |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url11   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | url11   |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url11   |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url11   |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url11   |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url11   |
+-----------+----------------------+---------+-----+---------+--+

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid  |      createtime      |   url   | rn  |  last1  |
+-----------+----------------------+---------+-----+---------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   |
+-----------+----------------------+---------+-----+---------+--+

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid  |      createtime      |   url   | rn  |  last1  |
+-----------+----------------------+---------+-----+---------+--+
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    |
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
+-----------+----------------------+---------+-----+---------+--+

如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_data;
+-----------+----------------------+---------+---------+--+
| cookieid  |      createtime      |   url   | first2  |
+-----------+----------------------+---------+---------+--+
| cookie1   | 2015-04-10 10:00:02  | url2    | url2    |
| cookie1   | 2015-04-10 10:50:01  | url5    | url2    |
| cookie1   | 2015-04-10 10:10:00  | url4    | url2    |
| cookie1   | 2015-04-10 11:00:00  | url7    | url2    |
| cookie1   | 2015-04-10 10:50:05  | url6    | url2    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | url2    |
| cookie1   | 2015-04-10 10:00:00  | url1    | url2    |
| cookie2   | 2015-04-10 10:50:01  | url55   | url55   |
| cookie2   | 2015-04-10 10:10:00  | url44   | url55   |
| cookie2   | 2015-04-10 11:00:00  | url77   | url55   |
| cookie2   | 2015-04-10 10:50:05  | url66   | url55   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | url55   |
| cookie2   | 2015-04-10 10:00:00  | url11   | url55   |
| cookie2   | 2015-04-10 10:00:02  | url22   | url55   |
+-----------+----------------------+---------+---------+--+
SELECT cookieid,
createtime,
url,
LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2
FROM test_data;
+-----------+----------------------+---------+--------+--+
| cookieid  |      createtime      |   url   | last2  |
+-----------+----------------------+---------+--------+--+
| cookie1   | 2015-04-10 10:00:02  | url2    | url1   |
| cookie1   | 2015-04-10 10:50:01  | url5    | url1   |
| cookie1   | 2015-04-10 10:10:00  | url4    | url1   |
| cookie1   | 2015-04-10 11:00:00  | url7    | url1   |
| cookie1   | 2015-04-10 10:50:05  | url6    | url1   |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | url1   |
| cookie1   | 2015-04-10 10:00:00  | url1    | url1   |
| cookie2   | 2015-04-10 10:50:01  | url55   | url22  |
| cookie2   | 2015-04-10 10:10:00  | url44   | url22  |
| cookie2   | 2015-04-10 11:00:00  | url77   | url22  |
| cookie2   | 2015-04-10 10:50:05  | url66   | url22  |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | url22  |
| cookie2   | 2015-04-10 10:00:00  | url11   | url22  |
| cookie2   | 2015-04-10 10:00:02  | url22   | url22  |
+-----------+----------------------+---------+--------+--+
14 rows selected (78.058 seconds)

如果想要取分组内排序后最后一个值,则需要变通一下:

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_data
ORDER BY cookieid,createtime;
+-----------+----------------------+---------+-----+---------+--------+--+
| cookieid  |      createtime      |   url   | rn  |  last1  | last2  |
+-----------+----------------------+---------+-----+---------+--------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    | url7   |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    | url7   |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   | url7   |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    | url7   |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    | url7   |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    | url7   |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    | url7   |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   | url77  |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   | url77  |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  | url77  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   | url77  |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   | url77  |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   | url77  |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   | url77  |
+-----------+----------------------+---------+-----+---------+--------+--+

原文地址:https://www.cnblogs.com/yy3b2007com/p/8582831.html

时间: 2024-10-08 03:12:25

Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE的相关文章

Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

1.LAG功能是什么? 2.LEAD与LAG功能有什么相似的地方那个? 3.FIRST_VALUE与LAST_VALUE分别完成什么功能? 继续学习这四个分析函数. 注意: 这几个函数不支持WINDOW子句. Hive版本为 apache-hive-0.13.1 数据准备: 水电费 cookie1,2015-04-10 10:00:02,url2 cookie1,2015-04-10 10:00:00,url1 cookie1,2015-04-10 10:03:04,1url3 cookie1,

oracle listagg函数、lag函数、lead函数 实例

Oracle大师Thomas Kyte在他的经典著作中,反复强调过一个实现需求方案选取顺序: “如果你可以使用一句SQL解决的需求,就使用一句SQL:如果不可以,就考虑PL/SQL是否可以:如果PL/SQL实现不了,就考虑Java存储过程是否可以:如果这些都不可能实现,那么就需要考虑你是否真的需要实现这个需求.” 各个关系型DBMS产品都在遵守关系型数据库模型的基本体系架构,遵循通用的SQL国际规范.同时,为了更好地配合自身数据库实现的特征,以及提供更加丰富的功能,各个DBMS纷纷在标准SQL上

Hive函数

1.时间函数 from_unixtime函数  用法为将时间戳转换为时间格式 语法: from_unixtime(bigint unixtime[, string format])   返回值为string 例如  hive>select from_unixtime(1326988805,'yyyyMMddHH') from test; 如果为字段转换的话,则为 select from_unixtime(time,'yyyyMMddHH') from test; 字段time如果为string类

Hive(四)hive函数与hive shell

一.hive函数 1.hive内置函数 (1)内容较多,见< Hive 官方文档>            https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF        (2)详细解释:            http://blog.sina.com.cn/s/blog_83bb57b70101lhmk.html (3) 测试内置函数的快捷方式: 1.创建一个 dual 表 create table dual

Oracle lag()/lead() over()分析函数

with tmp as(select '1' id ,'aa' name ,'22' age from dual union allselect '2' id ,'bb' name ,'20' age from dual union allselect '3' id ,'CC' name ,'21' age from dual)select a.*, lead(age,1) over (order by id desc) lag, a.age - lead(age,1) over (order

hive函数总结-日期函数

获取当前UNIX时间戳函数: unix_timestamp语法: unix_timestamp() 返回值: bigint说明: 获得当前时区的UNIX时间戳举例: hive> select unix_timestamp() from dual; OK 1455616811 Time taken: 0.785 seconds, Fetched: 1 row(s) 日期函数UNIX时间戳转日期函数: from_unixtime 语法:from_unixtime(bigint unixtime[,

hive函数总结

9.正则表达式解析函数:regexp_extract 语法: regexp_extract(string subject, string pattern, int index) 返回值: string 说明:将字符串subject按照pattern正则表达式的规则拆分,返回index指定的字符.注意,在有些情况下要使用转义字符 举例: [sql] view plain copy hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 1

hive函数----集合统计函数

集合统计函数1. 个数统计函数: count 语法: count(*), count(expr), count(DISTINCT expr[, expr_.]) 返回值: int 说明: count(*)统计检索出的行的个数,包括NULL值的行:count(expr)返回指定字段的非空值的个数:count(DISTINCTexpr[, expr_.])返回指定字段的不同的非空值的个数 举例: hive> select count(*) from lxw_dual; 20 hive> selec

over 分析函数之 lag() lead()

/*语法*/ lag(exp_str,offset,defval) over()  取前 Lead(exp_str,offset,defval) over()  取后 --exp_str要取的列 --offset取偏移后的第几行数据 --defval:没有符合条件的默认值 eg1: with tmp as(select '1' id ,'aa' name from dual union allselect '2' id ,'bb' name from dualunion allselect '3