hive中使用case、if:一个region统计业务(hive条件函数case、if、COALESCE语法介绍:CONDITIONAL FUNCTIONS IN HIVE)

前言:Hive ql自己设计总结
1,遇到复杂的查询情况,就分步处理。将一个复杂的逻辑,分成几个简单子步骤处理。
2,但能合在一起的,尽量和在一起的。比如同级别的多个concat函数合并一个select

也就是说,字段之间是并行的同级别处理,则放在一个hive ql;而字段间有前后处理逻辑依赖(判断、补值、计算)则可分步执行,提前将每个字段分别处理好,然后进行相应的分步简单逻辑处理。

一、 场景:日志中region数据处理(国家,省份,城市)
select city_id,province_id,country_id
from wizad_mdm_cleaned_hdfs
where city_id = ‘‘ or country_id = ‘‘ or province_id = ‘‘
group by city_id,province_id,country_id
二 、发现日志中有空数据:
38              1
        73      1
        75      1
64      81
        76      1
                      (全空)
        77         
三、设定过滤逻辑
if country_id=‘‘
         if province_id != ‘‘ then
                   if city_id = ‘‘ thenCONCAT(‘region_‘,‘1‘,‘_‘,province_id)
                   elseCONCAT(‘region_‘,‘1‘,‘_‘,province_id,‘_‘,city_id)
         else
                   if city_id != ‘‘ thenCONCAT(‘region_‘,‘1‘,‘_‘,parent_region_id,‘_‘,city_id)
else
         if province_id=‘‘
                   if city_id !=‘‘ thenCONCAT(‘region_‘,country_id,‘_‘,parent_region_id,‘_‘,city_id)
四、hive ql实现
SET mapred.queue.names=queue3;
SET mapred.reduce.tasks=14;
DROP TABLE IF EXISTS test_lmj_mdm_tmp1;
CREATE TABLE test_lmj_mdm_tmp1 AS
SELECT
guid,
(CASE country_id
WHEN ‘‘ THEN (CASE WHEN province_id=‘‘ THENIF(city_id = ‘‘,‘‘,CONCAT(‘region_‘,‘1‘,‘_‘,parent_region_id,‘_‘,city_id)) ELSEIF(city_id=‘‘,CONCAT(‘region_‘,‘1‘,‘_‘,province_id),CONCAT(‘region_‘,‘1‘,‘_‘,province_id,‘_‘,city_id))END)
ELSE (CASE when province_id=‘‘ THENIF(city_id=‘‘,CONCAT(‘region_‘,country_id),CONCAT(‘region_‘,country_id,‘_‘,parent_region_id,‘_‘,city_id))ELSE IF(city_id = ‘‘, CONCAT(‘region_‘,country_id,‘_‘,province_id),CONCAT(‘region_‘,country_id,‘_‘,province_id,‘_‘,city_id))END)
END )AS region,
(CASE connection_type WHEN ‘2‘ THENCONCAT(‘carrier_‘,‘wifi‘) ELSE CONCAT(‘carrier_‘,c.element_id) END) AS carrier,
SUM(CASE WHEN logtype = ‘1‘ THEN 1 ELSE 0END) AS imp_pv,
SUM(CASE WHEN logtype = ‘2‘ THEN 1 ELSE 0END) AS clk_pv
FROM wizad_mdm_cleaned_hdfs a
left outer joinwizad_mdm_dev_lmj_ad_campaign_industry_brand b
ON (a.wizad_ad_id = b.ad_id)
left outer join (SELECT * FROMwizad_mdm_dev_lmj_mapping_table_analytics WHERE TYPE = ‘7‘) c
ON (a.adn_id = c.ad_network_id ANDa.carrier_id = c.mapping_id)
left outer joinwizad_mdm_dev_lmj_app_category_analytics d
ON (a.app_category_id = d.adn_category)
left outer join (select region_template_id,parent_region_id from wizad_mdm_dev_lmj_region_template) e
ON (a.city_id = e.region_template_id)
WHERE a.day = ‘2015-01-01‘
GROUP BY guid,
(CASE country_id
WHEN ‘‘ THEN (CASE WHEN province_id = ‘‘THEN IF(city_id = ‘‘,‘‘,CONCAT(‘region_‘,‘1‘,‘_‘,parent_region_id,‘_‘,city_id))ELSEIF(city_id=‘‘,CONCAT(‘region_‘,‘1‘,‘_‘,province_id),CONCAT(‘region_‘,‘1‘,‘_‘,province_id,‘_‘,city_id))END)
ELSE (CASE when province_id=‘‘ THENIF(city_id=‘‘,CONCAT(‘region_‘,country_id),CONCAT(‘region_‘,country_id,‘_‘,parent_region_id,‘_‘,city_id))ELSEIF(city_id=‘‘,CONCAT(‘region_‘,country_id,‘_‘,province_id),CONCAT(‘region_‘,country_id,‘_‘,province_id,‘_‘,city_id))END)
END),
(CASE connection_type WHEN ‘2‘ THENCONCAT(‘carrier_‘,‘wifi‘) ELSE CONCAT(‘carrier_‘,c.element_id) END);
五、Hive ql语句分析

上例中使用case和if,语法参见最后{七、CONDITIONAL FUNCTIONS IN HIVE}

注意:

1,case特殊用法:case后可无对象,而在when后加条件判断语句,如,case when a=1 then true else false end;

2,select后的变换字段提取,对应在groupby中也要有,如carrier的case处理。(否则select不到)。但group by 后不能起表别名(as),select后可以。substring处理time时也一样在select和group by都有,

3,left outerjoin用子查询减少join时的内存

4,IF看版本才能用

六、Hive ql设计重构
初学者如我,总设计复杂逻辑,变态语句。
实际上,有经验的人面对逻辑太过复杂,应该分步操作。一个sql的高级同事重构上例。分两步:
 - 1)先分别给各字段补充合理值(能补充的补充,不能的置空)
 - 2)然后在region处理时直接过滤掉非法值记录
6.1步骤一语句
DROP TABLE IF EXISTS test_lmj_mdm_tmp;
CREATE TABLE test_lmj_mdm_tmp AS
SELECT
guid,
CONCAT(‘adn_‘,adn_id) AS adn,
CONCAT(‘time_‘,substr(createtime,12,2)) AS hour,
CONCAT(‘os_‘,os_id) AS os,
case when (country_id = ‘‘ or country_id = ‘NULL‘ or country_id isnull)
            and (province_id =‘‘ or province_id = ‘NULL‘ or province_id is null)
            and (city_id = ‘‘or city_id = ‘NULL‘ or city_id is null)
        then ‘‘
     when (country_id = ‘‘ orcountry_id = ‘NULL‘ or country_id is null)
            and (province_id<> ‘‘ or province_id <> ‘NULL‘ or province_id is not null orcity_id <> ‘‘ or city_id <> ‘NULL‘ or city_id is not null)
        then ‘1‘
     else country_id end ascountry_id,
case when (province_id = ‘‘ or province_id = ‘NULL‘ or province_idis null)
            ande.parent_region_id <> ‘‘ and e.parent_region_id <> ‘NULL‘ ande.parent_region_id is not null
        thene.parent_region_id
     else province_id end asprovince_id,
city_id,
CONCAT(‘campaign_‘,b.campaign_id) AS campaign,
CONCAT(‘interest_‘,b.industry_id) AS interest,
CONCAT(‘brand_‘,b.brand_id) AS brand,
(CASE connection_type WHEN ‘2‘ THEN CONCAT(‘carrier_‘,‘wifi‘) ELSECONCAT(‘carrier_‘,c.element_id) END) AS carrier,
CONCAT(‘appcategory_‘,d.wizad_category) AS appcategory,
uid,
SUM(CASE WHEN logtype = ‘1‘ THEN 1 ELSE 0 END) AS imp_pv,
SUM(CASE WHEN logtype = ‘2‘ THEN 1 ELSE 0 END) AS clk_pv
FROM ${clean_log_table} a
left outer join wizad_mdm_dev_lmj_ad_campaign_industry_brand b
ON (a.wizad_ad_id = b.ad_id)
left outer join (SELECT * FROMwizad_mdm_dev_lmj_mapping_table_analytics WHERE TYPE = ‘7‘) c
ON (a.adn_id = c.ad_network_id AND a.carrier_id = c.mapping_id)
left outer join wizad_mdm_dev_lmj_app_category_analytics d
ON (a.app_category_id = d.adn_category)
left outer join (select region_template_id, parent_region_id fromwizad_mdm_dev_lmj_region_template) e
ON (a.city_id = e.region_template_id)
WHERE a.day < ‘${pt}‘ and a.day >= ‘${time_span}‘
GROUP BY guid,
CONCAT(‘adn_‘,adn_id),
CONCAT(‘time_‘,substr(createtime,12,2)),
CONCAT(‘os_‘,os_id),
case when (country_id = ‘‘ or country_id = ‘NULL‘ or country_id isnull)
          and (province_id =‘‘ or province_id = ‘NULL‘ or province_id is null)
          and (city_id = ‘‘ orcity_id = ‘NULL‘ or city_id is null)
          then ‘‘
     when (country_id = ‘‘ orcountry_id = ‘NULL‘ or country_id is null)
          and (province_id<> ‘‘ or province_id <> ‘NULL‘ or province_id is not null orcity_id <> ‘‘ or city_id <> ‘NULL‘ or city_id is not null)
          then ‘1‘
     else country_id end,
case when (province_id = ‘‘ or province_id = ‘NULL‘ or province_idis null)
          and e.parent_region_id <> ‘‘ ande.parent_region_id <> ‘NULL‘ and e.parent_region_id is not null
          thene.parent_region_id
     else province_id end,
city_id,
CONCAT(‘campaign_‘,b.campaign_id),
CONCAT(‘interest_‘,b.industry_id),
CONCAT(‘brand_‘,b.brand_id),
(CASE connection_type WHEN ‘2‘ THEN CONCAT(‘carrier_‘,‘wifi‘) ELSECONCAT(‘carrier_‘,c.element_id) END),
CONCAT(‘appcategory_‘,d.wizad_category),
UID;
6.2步骤二语句
SELECT guid,CONCAT(‘region_‘,country_id,‘_‘,province_id,(case when city_id<> ‘‘ and city_id <> ‘NULL‘ and city_id is not null thenconcat(‘_‘,city_id) else ‘‘ end)) AS fixeddim,UID,SUM(imp_pv) AS pv
FROM test_lmj_mdm_tmp
where imp_pv > 0
and country_id <> ‘‘
and country_id <> ‘NULL‘
and country_id is not null
and province_id <> ‘‘
and province_id <> ‘NULL‘
and province_id is not null
GROUP BY guid,CONCAT(‘region_‘,country_id,‘_‘,province_id,(case whencity_id <> ‘‘ and city_id <> ‘NULL‘ and city_id is not null thenconcat(‘_‘,city_id) else ‘‘ end)),
UID

以下引自网络

七、CONDITIONALFUNCTIONS IN HIVE

Hive supports three types of conditional functions. These functions

are listed below:

IF( Test Condition, True Value, False Value )

The IF condition evaluates the “Test Condition” and if the “Test

Condition” is true, then it returns the “True Value”. Otherwise, it

returns the False Value. Example: IF(1=1, ‘working’, ‘not working’)

returns ‘working’

COALESCE( value1,value2,… )

The COALESCE function returns the fist not NULL value from the list of

values. If all the values in the list are NULL, then it returns NULL.

Example: COALESCE(NULL,NULL,5,NULL,4) returns 5

CASE Statement

The syntax for the case statement is: CASE [ expression ]

    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ...
    WHEN conditionn THEN resultn
    ELSE result END

Here expression is optional. It is the value that you are comparing to

the list of conditions. (ie: condition1, condition2, … conditionn).

All the conditions must be of same datatype. Conditions are evaluated

in the order listed. Once a condition is found to be true, the case

statement will return the result and not evaluate the conditions any

further.

转自:http://www.folkstalk.com/2011/11/conditional-functions-in-hive.html

All the results must be of same datatype. This is the value returned

once a condition is found to be true.

IF no condition is found to be true, then the case statement will

return the value in the ELSE clause. If the ELSE clause is omitted and

no condition is found to be true, then the case statement will return

NULL

Example:

    CASE   Fruit
        WHEN ‘APPLE‘ THEN ‘The owner is APPLE‘
        WHEN ‘ORANGE‘ THEN ‘The owner is ORANGE‘
        ELSE ‘It is another Fruit‘
    END

The other form of CASE is

    CASE
         WHEN Fruit = ‘APPLE‘ THEN ‘The owner is APPLE‘
         WHEN Fruit = ‘ORANGE‘ THEN ‘The owner is ORANGE‘
         ELSE ‘It is another Fruit‘
    END
时间: 2025-01-15 19:51:01

hive中使用case、if:一个region统计业务(hive条件函数case、if、COALESCE语法介绍:CONDITIONAL FUNCTIONS IN HIVE)的相关文章

【hive】时间段为五分钟的统计

问题内容 今天遇到了一个需求,需求就是时间段为5分钟的统计.有数据的时间戳.对成交单量进行统计. 想法思路 因为数据有时间戳,可以通过from_unixtime()来获取具体的时间. 有了具体的时间,就可以用minute()函数获取对应数据所在的分钟.(minute()获取到的分钟为字符串,需要进行类型转换cast()) 那么怎么通过获取到的minute来进行分组呢? 想法 00 - 05 应该分到一组, 05 - 10 应该分到第二组,依次类推. 用minute 整除 5 的话, 00 - 0

hive 的判断条件(if、coalesce、case)

CONDITIONAL FUNCTIONS IN HIVE Hive supports three types of conditional functions. These functions are listed below: IF( Test Condition, True Value, False Value ) The IF condition evaluates the "Test Condition" and if the "Test Condition&quo

sqoop 从oracle导数据到hive中,date型数据时分秒截断问题

oracle数据库中Date类型倒入到hive中出现时分秒截断问题解决方案 1.问题描述: 用sqoop将oracle数据表倒入到hive中,oracle中Date型数据会出现时分秒截断问题,只保留了‘yyyy-MM-dd',而不是’yyyy-MM-dd HH24:mi:ss'格式的,后面的‘HH24:mi:ss’被自动截断了,在对时间要求到秒级的分析处理中这种截断会产生问题. 2.解决方案: 在用sqoop倒入数据表是,添加--map-column-hive 和--map-column-jav

mysql中case的一个例子

最近遇到一个问题: year amount num 1991 1 1.1 1991 2 1.2 1991 3 1.3 1992 1 2.1 1992 2 2.2 1992 3 3.3 把上面表格的数据查询成: year m1 m2 m3 1991 1.1 1.2 1.3 1992 2.1 2.2 2.3 看到这样的需求,首先想到的是用case去统计以及 用group by来分组 第一版sql代码: SELECT `year`, (CASE WHEN amount = 1 THEN num END

Hadoop hive sqoop zookeeper hbase生产环境日志统计应用案例(Hive篇)

3.Hive安装配置 3.1安装MySQL 在datanode5上安装MySQL # yum -y installmysql-server mysql # mysql mysql> grant all privileges on *.* [email protected]'10.40.214.%' identified by "hive"; mysql> flush privileges; 3.2安装hive # tar -zxf apache-hive-0.13.1-bi

hive优化----控制hive中的map数

1. 通常情况下,作业会通过input的目录产生一个或者多个map任务. 主要的决定因素有: input的文件总个数,input的文件大小,集群设置的文件块大小(目前为128M, 可在hive中通过set dfs.block.size;命令查看到,该参数不能自定义修改): 2. 举例:a) 假设input目录下有1个文件a,大小为780M,那么hadoop会将该文件a分隔成7个块(6个128m的块和1个12m的块),从而产生7个map数b) 假设input目录下有3个文件a,b,c,大小分别为1

hive中order by,sort by, distribute by, cluster by作用以及用法

1. order by Hive中的order by跟传统的sql语言中的order by作用是一样的,会对查询的结果做一次全局排序,所以说,只有hive的sql中制定了order by所有的数据都会到同一个reducer进行处理(不管有多少map,也不管文件有多少的block只会启动一个reducer).但是对于大量数据这将会消耗很长的时间去执行. 这里跟传统的sql还有一点区别:如果指定了hive.mapred.mode=strict(默认值是nonstrict),这时就必须指定limit来

hive中partition如何使用

1.背景 1.在Hive Select查询中一般会扫描整个表内容,会消耗很多时间做没必要的工作.有时候只需要扫描表中关心的一部分数据,因此建表时引入了partition概念. 2.分区表指的是在创建表时指定的partition的分区空间. 3.如果需要创建有分区的表,需要在create表的时候调用可选参数partitioned by,详见表创建的语法结构. 2.细节 1.一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下. show partitions stage_

使用Sqoop,最终导入到hive中的数据和原数据库中数据不一致解决办法

Sqoop是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql.postgresql...)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中. 1.问题背景 使用Sqoop把oracle数据库中的一张表,这里假定为student,当中的数据导入到hdfs中,然后再创建hive的external表,location到刚才保存到hdfs中数