Hive综合案例分析之简易推荐系统

知识点:

1、Hive复合数据类型map与Lateral View的使用;

  map、str_to_map、map_keys、map_values,map与lateral view

2、通过translate进行简单数据保护;

  Hive转换函数进行数据保护,确保企业应用信息安全

3、Hive的窗口和分析函数入门;

  row_number、rank、dense_rank

创建订单表:

CREATE EXTERNAL TABLE f_orders (
    user_id   STRING
  , ts        STRING
  , order_id  STRING
  , items     map<STRING,BIGINT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘
COLLECTION ITEMS TERMINATED BY ‘|‘
MAP KEYS TERMINATED BY ‘:‘;

加载数据:

load data local inpath ‘/home/spark/software/data/f_orders.txt‘ overwrite into table f_orders;

查询数据:

select * from f_orders;
11      2014-05-01 06:01:12.334+01      10703007267488  {"item8":2,"item1":1}
22      2014-05-01 07:28:12.342+01      10101043505096  {"item6":3,"item3":2}
33      2014-05-01 07:50:12.33+01       10103043509747  {"item7":7}
11      2014-05-01 09:27:12.33+01       10103043501575  {"item5":5,"item1":1,"item4":1,"item9":1}
22      2014-05-01 09:03:12.324+01      10104043514061  {"item1":3}
33      2014-05-02 19:10:12.343+01      11003002067594  {"item4":2,"item1":1}
11      2014-05-02 09:07:12.344+01      10101043497459  {"item9":1}
35      2014-05-03 11:07:12.339+01      10203019269975  {"item5":1,"item1":1}
789     2014-05-03 12:59:12.743+01      10401003346256  {"item7":3,"item8":2,"item9":1}
77      2014-05-03 18:04:12.355+01      10203019262235  {"item5":2,"item1":1}
99      2014-05-04 00:36:39.713+01      10103044681799  {"item9":3,"item1":1}
33      2014-05-04 19:10:12.343+01      12345678901234  {"item5":1,"item1":1}
11      2014-05-05 09:07:12.344+01      12345678901235  {"item6":1,"item1":1}
35      2014-05-05 11:07:12.339+01      12345678901236  {"item5":2,"item1":1}
22      2014-05-05 12:59:12.743+01      12345678901237  {"item9":3,"item1":1}
77      2014-05-05 18:04:12.355+01      12345678901238  {"item8":3,"item1":1}
99      2014-05-05 20:36:39.713+01      12345678901239  {"item9":3,"item1":1}

从map中取值:map_keys, map_values

select map_keys(items), map_values(items) from f_orders where user_id = ‘35‘;
["item5","item1"]       [1,1]
["item5","item1"]       [2,1]

查询包含订单条目中有item8的订单

select * from f_orders where array_contains(map_keys(items), ‘item8‘);
11      2014-05-01 06:01:12.334+01      10703007267488  {"item1":1,"item8":2}
789     2014-05-03 12:59:12.743+01      10401003346256  {"item7":3,"item8":2,"item9":1}
77      2014-05-05 18:04:12.355+01      12345678901238  {"item1":1,"item8":3}

将f_orders中items列打开成横向视图

select user_id, order_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount;
11      10703007267488  item8   2
11      10703007267488  item1   1
22      10101043505096  item6   3
22      10101043505096  item3   2
33      10103043509747  item7   7
11      10103043501575  item5   5
11      10103043501575  item1   1
11      10103043501575  item4   1
11      10103043501575  item9   1
22      10104043514061  item1   3
33      11003002067594  item4   2
33      11003002067594  item1   1
11      10101043497459  item9   1
35      10203019269975  item5   1
35      10203019269975  item1   1
789     10401003346256  item7   3
789     10401003346256  item8   2
789     10401003346256  item9   1
77      10203019262235  item5   2
77      10203019262235  item1   1
99      10103044681799  item9   3
99      10103044681799  item1   1
33      12345678901234  item5   1
33      12345678901234  item1   1
11      12345678901235  item6   1
11      12345678901235  item1   1
35      12345678901236  item5   2
35      12345678901236  item1   1
22      12345678901237  item9   3
22      12345678901237  item1   1
77      12345678901238  item8   3
77      12345678901238  item1   1
99      12345678901239  item9   3
99      12345678901239  item1   1

创建订单条目表:

CREATE EXTERNAL TABLE d_items (
  item_sku  STRING,
  price     DOUBLE,
  catalogs  array<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘
COLLECTION ITEMS TERMINATED BY ‘|‘;

加载数据:

load data local inpath ‘/home/spark/software/data/d_items.txt‘ overwrite into table d_items;

查询数据:

select * from d_items;
item1   100.2   ["catalogA","catalogD","catalogX"]
item2   200.3   ["catalogA"]
item3   300.4   ["catalogA","catalogX"]
item4   400.5   ["catalogB"]
item5   500.6   ["catalogB","catalogX"]
item6   600.7   ["catalogB"]
item7   700.8   ["catalogC"]
item8   800.9   ["catalogC","catalogD"]
item9   899.99  ["catalogC","catalogA"]

求每个人的每个订单的金额

select orders.user_id, orders.order_id, round(sum(d.price*orders.amount), 2) as order_price
from (
  select user_id, order_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount
) orders
join d_items d
on (orders.item = d.item_sku)
group by orders.user_id, orders.order_id;

11      10101043497459  899.99
11      10103043501575  3903.69
11      10703007267488  1702.0
11      12345678901235  700.9
22      10101043505096  2402.9
22      10104043514061  300.6
22      12345678901237  2800.17
33      10103043509747  4905.6
33      11003002067594  901.2
33      12345678901234  600.8
35      10203019269975  600.8
35      12345678901236  1101.4
77      10203019262235  1101.4
77      12345678901238  2502.9
789     10401003346256  4604.19
99      10103044681799  2800.17
99      12345678901239  2800.17

求人和订单条目以及订单条目数量对应关系的数量

select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount;
11      item8   2
11      item1   1
22      item6   3
22      item3   2
33      item7   7
11      item5   5
11      item1   1
11      item4   1
11      item9   1
22      item1   3
33      item4   2
33      item1   1
11      item9   1
35      item5   1
35      item1   1
789     item7   3
789     item8   2
789     item9   1
77      item5   2
77      item1   1
99      item9   3
99      item1   1
33      item5   1
33      item1   1
11      item6   1
11      item1   1
35      item5   2
35      item1   1
22      item9   3
22      item1   1
77      item8   3
77      item1   1
99      item9   3
99      item1   1

订单条目与类别(类别打散后)的关系

select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog;
item1   catalogA
item1   catalogD
item1   catalogX
item2   catalogA
item3   catalogA
item3   catalogX
item4   catalogB
item5   catalogB
item5   catalogX
item6   catalogB
item7   catalogC
item8   catalogC
item8   catalogD
item9   catalogC
item9   catalogA

人和订单条目和订单条目数量以及与类别(类别打散后)的关系

select orders.user_id, orders.item, orders.amount, catalogs.catalog
from (
  select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount
) orders
join (
  select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog
) catalogs
on (orders.item = catalogs.item_sku)
;
11      item8   2       catalogC
11      item8   2       catalogD
11      item1   1       catalogA
11      item1   1       catalogD
11      item1   1       catalogX
22      item6   3       catalogB
22      item3   2       catalogA
22      item3   2       catalogX
33      item7   7       catalogC
11      item5   5       catalogB
11      item5   5       catalogX
11      item1   1       catalogA
11      item1   1       catalogD
11      item1   1       catalogX
11      item4   1       catalogB
11      item9   1       catalogC
11      item9   1       catalogA
22      item1   3       catalogA
22      item1   3       catalogD
22      item1   3       catalogX
33      item4   2       catalogB
33      item1   1       catalogA
33      item1   1       catalogD
33      item1   1       catalogX
11      item9   1       catalogC
11      item9   1       catalogA
35      item5   1       catalogB
35      item5   1       catalogX
35      item1   1       catalogA
35      item1   1       catalogD
35      item1   1       catalogX
789     item7   3       catalogC
789     item8   2       catalogC
789     item8   2       catalogD
789     item9   1       catalogC
789     item9   1       catalogA
77      item5   2       catalogB
77      item5   2       catalogX
77      item1   1       catalogA
77      item1   1       catalogD
77      item1   1       catalogX
99      item9   3       catalogC
99      item9   3       catalogA
99      item1   1       catalogA
99      item1   1       catalogD
99      item1   1       catalogX
33      item5   1       catalogB
33      item5   1       catalogX
33      item1   1       catalogA
33      item1   1       catalogD
33      item1   1       catalogX
11      item6   1       catalogB
11      item1   1       catalogA
11      item1   1       catalogD
11      item1   1       catalogX
35      item5   2       catalogB
35      item5   2       catalogX
35      item1   1       catalogA
35      item1   1       catalogD
35      item1   1       catalogX
22      item9   3       catalogC
22      item9   3       catalogA
22      item1   1       catalogA
22      item1   1       catalogD
22      item1   1       catalogX
77      item8   3       catalogC
77      item8   3       catalogD
77      item1   1       catalogA
77      item1   1       catalogD
77      item1   1       catalogX
99      item9   3       catalogC
99      item9   3       catalogA
99      item1   1       catalogA
99      item1   1       catalogD
99      item1   1       catalogX

将结果写到usr_cat_weight表中

create table usr_cat_weight as
select orders.user_id, catalogs.catalog, sum(orders.amount) as weight
from (
  select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount
) orders
join (
  select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog
) catalogs
on (orders.item = catalogs.item_sku)
group by orders.user_id, catalogs.catalog
order by user_id, weight desc;
select * from usr_cat_weight;
11      catalogX        8
11      catalogB        7
11      catalogD        5
11      catalogA        5
11      catalogC        4
22      catalogA        9
22      catalogX        6
22      catalogD        4
22      catalogB        3
22      catalogC        3
33      catalogC        7
33      catalogX        3
33      catalogB        3
33      catalogA        2
33      catalogD        2
35      catalogX        5
35      catalogB        3
35      catalogA        2
35      catalogD        2
77      catalogD        5
77      catalogX        4
77      catalogC        3
77      catalogA        2
77      catalogB        2
789     catalogC        6
789     catalogD        2
789     catalogA        1
99      catalogA        8
99      catalogC        6
99      catalogD        2
99      catalogX        2

row_number: 行号

select user_id, catalog, weight, row_number() OVER (PARTITION BY user_id ORDER BY weight DESC) as row_num FROM usr_cat_weight where user_id < ‘33‘;
11      catalogX        8       1
11      catalogB        7       2
11      catalogA        5       3
11      catalogD        5       4
11      catalogC        4       5
22      catalogA        9       1
22      catalogX        6       2
22      catalogD        4       3
22      catalogC        3       4
22      catalogB        3       5

rank: 相同的值排名是相同的,排名值会跳过重复排名的

select user_id, catalog, weight, rank() OVER (PARTITION BY user_id ORDER BY weight DESC) as rnk FROM usr_cat_weight where user_id < ‘33‘;
11      catalogX        8       1
11      catalogB        7       2
11      catalogA        5       3
11      catalogD        5       3
11      catalogC        4       5
22      catalogA        9       1
22      catalogX        6       2
22      catalogD        4       3
22      catalogC        3       4
22      catalogB        3       4

dense_rank:排名值不会跳过重复排名的

select user_id, catalog, weight, dense_rank() OVER (PARTITION BY user_id ORDER BY weight DESC) as drnk FROM usr_cat_weight where user_id < ‘33‘;
11      catalogX        8       1
11      catalogB        7       2
11      catalogA        5       3
11      catalogD        5       3
11      catalogC        4       4
22      catalogA        9       1
22      catalogX        6       2
22      catalogD        4       3
22      catalogC        3       4
22      catalogB        3       4
CREATE TABLE usr_cat AS
select user_id, catalog, row_number() OVER (PARTITION BY user_id ORDER BY weight DESC) as row_num
FROM (
select orders.user_id, catalogs.catalog, sum(orders.amount) as weight
from (
  select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount
) orders
join (
  select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog
) catalogs
on (orders.item = catalogs.item_sku)
group by orders.user_id, catalogs.catalog
order by user_id, weight
) x
ORDER BY user_id, row_num;
select * from usr_cat;
11      catalogX        1
11      catalogB        2
11      catalogA        3
11      catalogD        4
11      catalogC        5
22      catalogA        1
22      catalogX        2
22      catalogD        3
22      catalogC        4
22      catalogB        5
33      catalogC        1
33      catalogB        2
33      catalogX        3
33      catalogD        4
33      catalogA        5
35      catalogX        1
35      catalogB        2
35      catalogA        3
35      catalogD        4
77      catalogD        1
77      catalogX        2
77      catalogC        3
77      catalogA        4
77      catalogB        5
789     catalogC        1
789     catalogD        2
789     catalogA        3
99      catalogA        1
99      catalogC        2
99      catalogD        3
99      catalogX        4

创建用户表:

CREATE EXTERNAL TABLE d_users (
    user_id  STRING
  , gender   STRING
  , birthday STRING
  , email    STRING
  , regday   STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\073‘;

加载数据:

load data local inpath ‘/home/spark/software/data/d_users.txt‘ overwrite into table d_users;

查询:

select * from d_users;
11      m       1981-01-01      张三@gmail.com        2014-04-21
22      w       1982-01-01      user22@abcn.net 2014-04-22
33      m       1983-01-01      user33@fxlive.de        2014-04-23
77      w       1977-01-01      user77@fxlive.fr        2014-05-01
88      m       1988-01-01      user88@fxlive.eu        2014-05-02
99      w       1999-01-01      user99@abcn.net 2014-05-03
789     m       2008-01-01      admin@abcn.net  2014-05-03

Hive转换函数translate进行简单数据保护

select user_id, birthday, translate(birthday, ‘0123456789‘, ‘1234567890‘), email, translate(email, ‘userfxgmail1234567890‘, ‘1234567890userfxgmail‘) from d_users;
11      1981-01-01      2092-12-12      user11@gmail.com        1234ss@7890u.co8
22      1982-01-01      2093-12-12      user22@abcn.net 1234ee@9bcn.n3t
33      1983-01-01      2094-12-12      user33@fxlive.de        1234rr@56u0v3.d3
77      1977-01-01      2088-12-12      user77@fxlive.fr        1234mm@56u0v3.54
88      1988-01-01      2099-12-12      user88@fxlive.eu        1234aa@56u0v3.31
99      1999-01-01      2000-12-12      user99@abcn.net 1234ii@9bcn.n3t
789     2008-01-01      3119-12-12      admin@abcn.net  9d80n@9bcn.n3t
时间: 2024-10-10 10:21:44

Hive综合案例分析之简易推荐系统的相关文章

Hive综合案例分析之开窗函数使用

知识点: 1.Hive的窗口和分析函数进阶 CUME_DIST 小于等于当前行值的行数 / 总行数 PERCENT_RANK 当前rank值-1 / 总行数-1 NTILE 将窗口分成n片 LEAD(col, n, default) 窗口内下n行值 LAG(col, n , default) 窗口内上n行值 FIRST_VALUE 窗口内第一个值 LAST_VALUE 窗口内最后一个值 2.分析函数中包含三个分析子句 分组(Partition By) 排序(Order By) 窗口(Window

Hive综合案例分析之用户上网行为分析

知识点:1.Hive复合数据类型:array collect_set collect_list array_contains sort_array 2.lateral view explode(array) lateral view out 需求: click_log : cookie_id     ad_id      time ad_list: ad_id     ad_url     catalog_list 统计: cookie_catalog: cookie_id     ad_cat

Hive综合案例分析之不正常订单状态统计

需求 订单有5个状态:创建.捡货.发送.送达.取消 统计:创建和捡货之间不能操作2小时,创建到发送时间不能操作4小时,创建到送达之间不能超过48小时. 知识点 1)external table 2)desc formatted的使用 3)virtual column 4)Alter FILEFORMAT 5)COALESCE.unix_timestamp的使用 6)PARQUET 实现 外部表 订单创建表: CREATE EXTERNAL TABLE order_created ( orderN

分布式事物之综合案例分析

7.1系统介绍 7.1.1. P2P介绍 P2P 金融又叫P2P信贷.其中P2P是 peer-to-peer 或 person-to-person 的简写,意思是:个人对个人.P2P金融指个人与个人间的小额借贷交易,一般需要借助电子商务专业网络平台帮助借贷双方确立借贷关系并完成相关交易手续.借款者可自行发布借款信息,包括金额.利息.还款方式和时间,实现自助式借款;投资者根据借款人发布的信息,自行决定出借金额,实现自助式借贷.目前,国家对P2P行业的监控与规范性控制越来越严格,出台了很多政策来对其

综合案例分析(sort,cut,正则)

1.    找出ifconfig "网卡名" 命令结果中本机IPv4地址 分析: 解释:要取出ip地址,首先我们可以先取出ip所在的行,即取行:可以结合head和tail,后面会有 更好的方法去取行,取列当然会想到cut命令,但是此例中,我们要考虑分隔符(空格和冒号), 因此tr的引入,会使题目变得更加简单. 答: 在这里小编仅提供一种比较好的方法. 2.查出分区空间使用率的最大百分比值 分析: 解释;先附上一张df查看的结果,比较容易解释 首先我们可以用df查看分区,很明显我们需要的

ccnp大型企业综合案例分析

这个项目主要实现思路关键点之独孤九剑: Ip地址的规范 接口对应表的整理 主次关系的整理 分清楚什么是二层技术什么是三层技术 对于相同的预配置先在记事本写好,利用crt直接粘贴复制,这样节省时间和提高效率. 几种交换协议的一句话理解: Vtp 是用来简化vlan 的配置,思科专有.公有GVRP. Vtp 配置方法:两台交换机之间用trunk 相连,配置服务端与客户端,配置相同的密码, 域名,版本.服务器配置版本高于客户机. Stp pvst mst 生成树,快速生成树,多生成树. 生成树是用来防

2016年5月信息系统项目管理师临门一脚重点串讲(综合知识、案例分析、重点论文、计算题)

http://edu.51cto.com/course/course_id-5868.html 1.旨在帮助大家快速通过软考,少受备考的折磨与孤独. 2.28小时,不到2天的时间,快速学完100天的内容 3.着重梳理综合知识重点高频考点,快速提升大家综合知识得分能力 4.多角度剖析案例分析,提升大家案例分析应试能力. 5.从论文框架与模版.到重点论文的准备,尽在掌控. 为帮助大家提高复习效率,以最小的代价通过信息系统项目管理师,本套软考冲刺临门一脚,从以下方面进行课程优化与组合:1.信息化或计算

系统架构设计师2009-2018历年综合知识、案例分析、论文真题及答案详细解析

https://blog.csdn.net/xxlllq/article/details/85049295 ??系统架构设计师复习资料当您看了这篇文章有何疑问,可先看最后的评论,有可能您遇到的问题其他人已经提出我已回复. 2018/12/14查询成绩后知道自己通过了系统架构设计师的考试(每科满分75,及格45分),特地记录一下.最终的成绩如下: 我是在9月份决定报名参加系统架构设计师考试,主要是想借此机会督促自己学习些除工作外的知识.准备了2个月,复习时间为周末,复习方式为看教学视频和真题练习.

(版本定制)第5课:基于案例分析Spark Streaming流计算框架的运行源码

本期内容: 1.在线动态计算分类最热门商品案例回顾与演示 2.基于案例分析Spark Streaming的运行源码 第一部分案例: package com.dt.spark.sparkstreaming import com.robinspark.utils.ConnectionPoolimport org.apache.spark.SparkConfimport org.apache.spark.sql.Rowimport org.apache.spark.sql.hive.HiveConte