hive on tez sql 优化

select t.id end_usr_id,

t.end_user_name end_usr_name,

t.end_user_real_name end_usr_nknm,

t.end_user_password end_usr_pswd,

t.end_user_birthday end_usr_bthd,

t.mobile mobl_num,

t.phone phone_num,

t.valid_mobile_phone_num bindg_mobl_num,

t.end_user_email end_usr_email,

substr(t.end_user_email,

instr(t.end_user_email, ‘@‘),

instr(t.end_user_email, ‘.‘) - instr(t.end_user_email, ‘@‘)) email_class,

t.end_user_create_time creat_time,

t.end_user_last_login_date end_usr_last_login_date,

t.end_user_last_bought_date end_usr_last_buy_date,

t.end_user_login_times end_usr_login_times,

t.end_user_bought_times end_usr_buy_times,

t.end_user_bought_amount end_usr_buy_amt,

t.end_user_type end_usr_type,

t.ip ip,

t.end_user_points end_usr_intgl,

t.id_card id_card,

t.end_user_sex sex,

t.mc_site_id mc_site_id,

case

when t.is_email_activate = 1 then

1

else

0

end as email_verfd_flg,

case

when t.valid_mobile_phone_num is not null then

1

else

0

end as mobl_bding_flg,

case

when t.end_user_name like ‘%@sina%‘ then

1

when t.end_user_name like ‘%@pingan%‘ then

2

when t.end_user_name like ‘%@alipay%‘ and

not (t2.user_grade in (1, 2) and t2.user_grade_type = 1) then

3

when t.end_user_name like ‘%@alipay%‘ and t2.user_grade in (1, 2) and

t2.user_grade_type = 1 then

4

when t.end_user_name like ‘%@163%‘ then

5

when t.end_user_name like ‘%@kaixin001%‘ then

6

when t.end_user_name like ‘%@139%‘ then

7

when t.end_user_name like ‘%@msn.com%‘ then

8

when t.end_user_name like ‘%@anyue%‘ then

9

when t.end_user_name like ‘%@qq%‘ then

10

else

0

end as commn_login_type,

case

when t.end_user_name like ‘%@b2b%‘ or t.end_user_name = ‘yuxiaolan‘ then

1

else

0

end as b2b_flg,

case

when t3.send_to is not null then

1

else

0

end as email_ret_flg,

case

when t4.email is not null then

1

else

0

end as email_blacklist_flg,

case

when t5.car = ‘1‘ then

1

else

0

end as car_flg,

t5.shopping_habit shppg_prefr,

t6.mail_level email_lvl,

‘‘ whsl_flg,

‘‘ xh_flg,

‘‘ exptn_flg,

‘‘ b2b_vip_flg,

t.member_grade end_usr_grd_id,

case

when t.co_code is null then

null

when t.co_code is not null and t7.usr_union_logon_id is null then

-999999

else

t7.usr_union_logon_id

end usr_union_logon_id,

t.co_code usr_union_logon_code,

t8.exp end_usr_grow_val

from tandem.end_user t

left outer join tandem.alipay_user t2

on t.id = t2.end_user_id

left outer join (select distinct lower(send_to) as send_to

from tandem.yhd_send_message_blacklist) t3

on lower(t.end_user_email) = t3.send_to

left outer join (select distinct lower(email) as email

from tandem.edm_error_email) t4

on lower(t.end_user_email) = t4.email

left outer join tmp_end_usr_info t5

on t.id = t5.end_usr_id

left outer join tandem.crm_user_quality t6

on t.end_user_email = t6.email

and t6.user_id is not null

left outer join dw.usr_union_logon t7

on t.co_code = t7.usr_union_logon_code

and t7.usr_union_logon_code is not null

left outer join tandem.member t8

on t.id = t8.end_user_id

and to_date(t8.update_time) = ‘2015-05-04‘

where t.update_time >= ‘2015-05-04‘

or length(t8.end_user_id)>0;

运行速度慢

执行计划如下:

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 depends on stages: Stage-1

STAGE PLANS:

Stage: Stage-1

Tez

Edges:

Reducer 2 <- Map 1 (SIMPLE_EDGE)

Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE), Reducer 8 (SIMPLE_EDGE)

Reducer 4 <- Map 11 (SIMPLE_EDGE), Map 12 (BROADCAST_EDGE), Reducer 3 (SIMPLE_EDGE)

Reducer 6 <- Map 5 (SIMPLE_EDGE)

Reducer 8 <- Map 10 (SIMPLE_EDGE), Map 13 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)

DagName: zhoushugang_20150505143131_1821ced5-3723-4094-b6a3-c2aae136cd32:2

Vertices:

Map 1

Map Operator Tree:

TableScan

alias: yhd_send_message_blacklist

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: send_to (type: string)

outputColumnNames: send_to

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Group By Operator

keys: lower(send_to) (type: string)

mode: hash

outputColumnNames: _col0

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Map 10

Map Operator Tree:

TableScan

alias: t5

Statistics: Num rows: 4414357 Data size: 67807378 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: end_usr_id (type: bigint)

sort order: +

Map-reduce partition columns: end_usr_id (type: bigint)

Statistics: Num rows: 4414357 Data size: 67807378 Basic stats: COMPLETE Column stats: NONE

value expressions: car (type: string), shopping_habit (type: string)

Map 11

Map Operator Tree:

TableScan

alias: t6

Statistics: Num rows: 38930100 Data size: 4515891712 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: user_id is not null (type: boolean)

Statistics: Num rows: 19465050 Data size: 2257945856 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: email (type: string)

sort order: +

Map-reduce partition columns: email (type: string)

Statistics: Num rows: 19465050 Data size: 2257945856 Basic stats: COMPLETE Column stats: NONE

value expressions: mail_level (type: double)

Map 12

Map Operator Tree:

TableScan

alias: t7

Statistics: Num rows: 16 Data size: 1752 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: usr_union_logon_code is not null (type: boolean)

Statistics: Num rows: 8 Data size: 876 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: usr_union_logon_code (type: string)

sort order: +

Map-reduce partition columns: usr_union_logon_code (type: string)

Statistics: Num rows: 8 Data size: 876 Basic stats: COMPLETE Column stats: NONE

value expressions: usr_union_logon_id (type: bigint)

Map 13

Map Operator Tree:

TableScan

alias: t8

Statistics: Num rows: 25660729 Data size: 3178607508 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: (to_date(update_time) = ‘2015-05-04‘) (type: boolean)

Statistics: Num rows: 12830364 Data size: 1589303692 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: end_user_id (type: bigint)

sort order: +

Map-reduce partition columns: end_user_id (type: bigint)

Statistics: Num rows: 12830364 Data size: 1589303692 Basic stats: COMPLETE Column stats: NONE

value expressions: exp (type: int)

Map 5

Map Operator Tree:

TableScan

alias: edm_error_email

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: email (type: string)

outputColumnNames: email

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Group By Operator

keys: lower(email) (type: string)

mode: hash

outputColumnNames: _col0

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Map 7

Map Operator Tree:

TableScan

alias: t

Statistics: Num rows: 108659208 Data size: 49700768672 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: id (type: bigint)

sort order: +

Map-reduce partition columns: id (type: bigint)

Statistics: Num rows: 108659208 Data size: 49700768672 Basic stats: COMPLETE Column stats: NONE

value expressions: end_user_name (type: string), end_user_password (type: string), end_user_real_name (type: string), end_user_birthday (type: string), end_user_last_login_date (type: string), end_user_last_bought_date (type: string), end_user_login_times (type: double), end_user_bought_amount (type: double), end_user_bought_times (type: double), end_user_sex (type: double), end_user_create_time (type: string), end_user_type (type: double), ip (type: string), end_user_points (type: double), co_code (type: string), is_email_activate (type: double), mc_site_id (type: bigint), update_time (type: string), member_grade (type: double), end_user_email (type: string), mobile (type: string), phone (type: string), valid_mobile_phone_num (type: int), id_card (type: string)

Map 9

Map Operator Tree:

TableScan

alias: t2

Statistics: Num rows: 11871989 Data size: 422420127 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: end_user_id (type: bigint)

sort order: +

Map-reduce partition columns: end_user_id (type: bigint)

Statistics: Num rows: 11871989 Data size: 422420127 Basic stats: COMPLETE Column stats: NONE

value expressions: user_grade (type: double), user_grade_type (type: double)

Reducer 2

Reduce Operator Tree:

Group By Operator

keys: KEY._col0 (type: string)

mode: mergepartial

outputColumnNames: _col0

Statistics: Num rows: 435430 Data size: 23429814 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 435430 Data size: 23429814 Basic stats: COMPLETE Column stats: NONE

Reducer 3

Reduce Operator Tree:

Merge Join Operator

condition map:

Left Outer Join0 to 1

Left Outer Join0 to 2

keys:

0 lower(_col107) (type: string)

1 _col0 (type: string)

2 _col0 (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col84, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col131, _col133, _col145, _col146

Statistics: Num rows: 788865883 Data size: 360827596199 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col107 (type: string)

sort order: +

Map-reduce partition columns: _col107 (type: string)

Statistics: Num rows: 788865883 Data size: 360827596199 Basic stats: COMPLETE Column stats: NONE

value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col22 (type: string), _col24 (type: string), _col30 (type: double), _col31 (type: double), _col32 (type: double), _col34 (type: double), _col44 (type: string), _col67 (type: double), _col68 (type: string), _col69 (type: double), _col72 (type: string), _col78 (type: double), _col83 (type: bigint), _col84 (type: string), _col103 (type: double), _col108 (type: string), _col109 (type: string), _col110 (type: int), _col111 (type: string), _col117 (type: double), _col118 (type: double), _col125 (type: string), _col126 (type: string), _col131 (type: bigint), _col133 (type: int), _col145 (type: string), _col146 (type: string)

Reducer 4 

            Reduce Operator Tree:

              Merge Join Operator

                condition map:

                     Left Outer Join0 to 1

                keys:

                  0 _col107 (type: string)

                  1 email (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col84, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col131, _col133, _col145, _col146, _col157

Statistics: Num rows: 867752490 Data size: 396910364421 Basic stats: COMPLETE Column stats: NONE

Map Join Operator

                  condition map:

                       Left Outer Join0 to 1

                  keys:

                    0 _col72 (type: string)

                    1 usr_union_logon_code (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col84, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col131, _col133, _col145, _col146, _col157, _col163

input vertices:

1 Map 12

Statistics: Num rows: 954527759 Data size: 436601410326 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: ((_col84 >= ‘2015-05-04‘) or (length(_col131) > 0)) (type: boolean)

Statistics: Num rows: 636351838 Data size: 291067606274 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: _col0 (type: bigint), _col1 (type: string), _col3 (type: string), _col2 (type: string), _col4 (type: string), _col108 (type: string), _col109 (type: string), _col110 (type: int), _col107 (type: string), substr(_col107, instr(_col107‘@‘), (instr(_col107‘.‘) - instr(_col107‘@‘))) (type: string), _col44 (type: string), _col22 (type: string), _col24 (type: string), _col30 (type: double), _col32 (type: double), _col31 (type: double), _col67 (type: double), _col68 (type: string), _col69 (type: double), _col111 (type: string), _col34 (type: double), _col83 (type: bigint), CASE WHEN ((_col78 = 1.0)) THEN (1) ELSE (0) END (type: int), CASE WHEN (_col110 is not null) THEN (1) ELSE (0) END (type: int), CASE WHEN ((_col1 like ‘%@sina%‘)) THEN (1) WHEN ((_col1 like ‘%@pingan%‘)) THEN (2) WHEN (((_col1 like ‘%@alipay%‘) and (not ((_col117) IN (1, 2) and (_col118 = 1.0))))) THEN (3) WHEN ((((_col1 like ‘%@alipay%‘) and (_col117) IN (1, 2)) and (_col118 = 1.0))) THEN (4) WHEN ((_col1 like ‘%@163%‘)) THEN (5) WHEN ((_col1 like ‘%@kaixin001%‘)) THEN (6) WHEN ((_col1 like ‘%@139%‘)) THEN (7) WHEN ((_col1 like ‘%@msn.com%‘)) THEN (8) WHEN ((_col1 like ‘%@anyue%‘)) THEN (9) WHEN ((_col1 like ‘%@qq%‘)) THEN (10) ELSE (0) END (type: int), CASE WHEN (((_col1 like ‘%@b2b%‘) or (_col1 = ‘yuxiaolan‘))) THEN (1) ELSE (0) END (type: int), CASE WHEN (_col145 is not null) THEN (1) ELSE (0) END (type: int), CASE WHEN (_col146 is not null) THEN (1) ELSE (0) END (type: int), CASE WHEN ((_col125 = ‘1‘)) THEN (1) ELSE (0) END (type: int), _col126 (type: string), _col157 (type: double), ‘‘ (type: string), ‘‘ (type: string), ‘‘ (type: string), ‘‘ (type: string), _col103 (type: double), CASE WHEN (_col72 is null) THEN (null) WHEN ((_col72 is not null and _col163 is null)) THEN (-999999) ELSE (_col163) END (type: bigint), _col72 (type: string), _col133 (type: int)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38

Statistics: Num rows: 636351838 Data size: 291067606274 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 636351838 Data size: 291067606274 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Reducer 6

Reduce Operator Tree:

Group By Operator

keys: KEY._col0 (type: string)

mode: mergepartial

outputColumnNames: _col0

Statistics: Num rows: 2568953 Data size: 256895341 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 2568953 Data size: 256895341 Basic stats: COMPLETE Column stats: NONE

Reducer 8 

            Reduce Operator Tree:

              Merge Join Operator

                condition map:

                     Left Outer Join0 to 1

                     Left Outer Join0 to 2

                     Left Outer Join0 to 3

                keys:

                  0 id (type: bigint)

                  1 end_user_id (type: bigint)

                  2 end_usr_id (type: bigint)

                  3 end_user_id (type: bigint)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col84, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col131, _col133

Statistics: Num rows: 358575394 Data size: 164012540172 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: lower(_col107) (type: string)

sort order: +

Map-reduce partition columns: lower(_col107) (type: string)

Statistics: Num rows: 358575394 Data size: 164012540172 Basic stats: COMPLETE Column stats: NONE

value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col22 (type: string), _col24 (type: string), _col30 (type: double), _col31 (type: double), _col32 (type: double), _col34 (type: double), _col44 (type: string), _col67 (type: double), _col68 (type: string), _col69 (type: double), _col72 (type: string), _col78 (type: double), _col83 (type: bigint), _col84 (type: string), _col103 (type: double), _col107 (type: string), _col108 (type: string), _col109 (type: string), _col110 (type: int), _col111 (type: string), _col117 (type: double), _col118 (type: double), _col125 (type: string), _col126 (type: string), _col131 (type: bigint), _col133 (type: int)

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

Tez 合并了多个join给一个reduce,避免了启动过多的job(见粗体部分)已经有所优化,但是为啥此sql执行慢呢?

-------------------------------------------------------------------------------- 
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
-------------------------------------------------------------------------------- 
Map 1 .......... SUCCEEDED 1 1 0 0 0 0 
Map 10 ......... SUCCEEDED 3 3 0 0 0 0 
Map 11 ......... SUCCEEDED 151 151 0 0 0 0 
Map 12 ......... SUCCEEDED 1 1 0 0 0 0 
Map 13 ......... SUCCEEDED 76 76 0 0 0 0 
Map 5 .......... SUCCEEDED 11 11 0 0 0 0 
Map 7 .......... SUCCEEDED 156 156 0 0 0 0 
Map 9 .......... SUCCEEDED 10 10 0 0 0 0 
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 
Reducer 3 ..... RUNNING 642 641 1 0 0 0 
Reducer 4 RUNNING 1009 0 89 920 0 0 
Reducer 6 ...... SUCCEEDED 3 3 0 0 0 0 
Reducer 8 ...... SUCCEEDED 203 203 0 0 0 0 
-------------------------------------------------------------------------------- 
VERTICES: 11/13 [==============>>------------] 55% ELAPSED TIME: 307.54 s

Reduce 3 中有1个任务始终RUNNING

查看SQL中最后的where 条件发现

where t.update_time>= ‘2015-05-04‘   or t8.end_user_id>0;  导致无法提前过滤掉t.update_time< ‘2015-05-04‘的数据,导致后续join产生的reduce数据过大(见上面执行计划蓝色部分),可能产生了数据倾斜,导致任务时长拉长。

去掉or t8.end_user_id>0的时候,end_user   TS返回36219736 行

Map 7

MapOperator Tree:

TableScan

alias: t

Statistics: Num rows: 108659208 Data size: 49700768672 Basic stats: COMPLETEColumn stats: NONE

Filter Operator

predicate: (update_time >=‘2015-05-04‘) (type: boolean)

Statistics: Num rows: 36219736 Data size: 16566922890 Basic stats: COMPLETEColumn stats: NONE

Reduce Output Operator

key expressions: id (type: bigint)

sort order: +

Map-reduce partition columns: id (type: bigint)

Statistics: Num rows: 36219736 Data size: 16566922890 Basic stats: COMPLETEColumn stats: NONE

value expressions: end_user_name (type: string),

没有去掉or t8.end_user_id>0的时候,end_user   TS返回108659208行

Map 7

MapOperator Tree:

TableScan

alias: t

Statistics: Num rows: 108659208 Data size: 49700768672 Basic stats: COMPLETEColumn stats: NONE

Reduce Output Operator

key expressions: id (type: bigint)

sort order: +

Map-reduce partition columns: id (type: bigint)

Statistics: Num rows: 108659208 Data size: 49700768672 Basic stats: COMPLETEColumn stats: NONE

value expressions: end_user_name (type: string),end_user_password (type: string), end_user_real_name (type: string),end_user_birthday (type: string), end_user_last_login_date (type: string),end_user_last_bought_date (type: string), end_user_login_times (type: double),end_user_bought_amount (type: double), end_user_bought_times (type: double),end_user_sex (type: double), end_user_create_time (type: string), end_user_type(type: double), ip (type: string), end_user_points (type: double), co_code(type: string), is_email_activate (type: double), mc_site_id (type: bigint),update_time (type: string), member_grade (type: double), end_user_email (type:string), mobile (type: string), phone (type: string), valid_mobile_phone_num(type: int), id_card (type: string)

业务部门说可以去掉or t8.end_user_id>0 这样就大大减少了 reduce数据的结果集。

最后附上去掉or t8.end_user_id>0的执行计划:

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 depends on stages: Stage-1

STAGE PLANS:

Stage: Stage-1

Tez

Edges:

Reducer 2 <- Map 1 (SIMPLE_EDGE)

Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE), Reducer 8 (SIMPLE_EDGE)

Reducer 4 <- Map 11 (SIMPLE_EDGE), Map 12 (BROADCAST_EDGE), Reducer 3 (SIMPLE_EDGE)

Reducer 6 <- Map 5 (SIMPLE_EDGE)

Reducer 8 <- Map 10 (SIMPLE_EDGE), Map 13 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)

DagName: zhoushugang_20150505143030_fda7333f-2147-4a65-a63f-7c08c8a3630b:1

Vertices:

Map 1

Map Operator Tree:

TableScan

alias: yhd_send_message_blacklist

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: send_to (type: string)

outputColumnNames: send_to

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Group By Operator

keys: lower(send_to) (type: string)

mode: hash

outputColumnNames: _col0

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 870860 Data size: 46859629 Basic stats: COMPLETE Column stats: NONE

Map 10

Map Operator Tree:

TableScan

alias: t5

Statistics: Num rows: 4414357 Data size: 67807378 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: end_usr_id (type: bigint)

sort order: +

Map-reduce partition columns: end_usr_id (type: bigint)

Statistics: Num rows: 4414357 Data size: 67807378 Basic stats: COMPLETE Column stats: NONE

value expressions: car (type: string), shopping_habit (type: string)

Map 11

Map Operator Tree:

TableScan

alias: t6

Statistics: Num rows: 38930100 Data size: 4515891712 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: user_id is not null (type: boolean)

Statistics: Num rows: 19465050 Data size: 2257945856 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: email (type: string)

sort order: +

Map-reduce partition columns: email (type: string)

Statistics: Num rows: 19465050 Data size: 2257945856 Basic stats: COMPLETE Column stats: NONE

value expressions: mail_level (type: double)

Map 12

Map Operator Tree:

TableScan

alias: t7

Statistics: Num rows: 16 Data size: 1752 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: usr_union_logon_code is not null (type: boolean)

Statistics: Num rows: 8 Data size: 876 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: usr_union_logon_code (type: string)

sort order: +

Map-reduce partition columns: usr_union_logon_code (type: string)

Statistics: Num rows: 8 Data size: 876 Basic stats: COMPLETE Column stats: NONE

value expressions: usr_union_logon_id (type: bigint)

Map 13

Map Operator Tree:

TableScan

alias: t8

Statistics: Num rows: 25660729 Data size: 3178607508 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: (to_date(update_time) = ‘2015-05-04‘) (type: boolean)

Statistics: Num rows: 12830364 Data size: 1589303692 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: end_user_id (type: bigint)

sort order: +

Map-reduce partition columns: end_user_id (type: bigint)

Statistics: Num rows: 12830364 Data size: 1589303692 Basic stats: COMPLETE Column stats: NONE

value expressions: exp (type: int)

Map 5

Map Operator Tree:

TableScan

alias: edm_error_email

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: email (type: string)

outputColumnNames: email

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Group By Operator

keys: lower(email) (type: string)

mode: hash

outputColumnNames: _col0

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 5137907 Data size: 513790784 Basic stats: COMPLETE Column stats: NONE

Map 7

Map Operator Tree:

TableScan

alias: t

Statistics: Num rows: 108659208 Data size: 49700768672 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: (update_time >= ‘2015-05-04‘) (type: boolean)

Statistics: Num rows: 36219736 Data size: 16566922890 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: id (type: bigint)

sort order: +

Map-reduce partition columns: id (type: bigint)

Statistics: Num rows: 36219736 Data size: 16566922890 Basic stats: COMPLETE Column stats: NONE

value expressions: end_user_name (type: string), end_user_password (type: string), end_user_real_name (type: string), end_user_birthday (type: string), end_user_last_login_date (type: string), end_user_last_bought_date (type: string), end_user_login_times (type: double), end_user_bought_amount (type: double), end_user_bought_times (type: double), end_user_sex (type: double), end_user_create_time (type: string), end_user_type (type: double), ip (type: string), end_user_points (type: double), co_code (type: string), is_email_activate (type: double), mc_site_id (type: bigint), member_grade (type: double), end_user_email (type: string), mobile (type: string), phone (type: string), valid_mobile_phone_num (type: int), id_card (type: string)

Map 9

Map Operator Tree:

TableScan

alias: t2

Statistics: Num rows: 11871989 Data size: 422420127 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: end_user_id (type: bigint)

sort order: +

Map-reduce partition columns: end_user_id (type: bigint)

Statistics: Num rows: 11871989 Data size: 422420127 Basic stats: COMPLETE Column stats: NONE

value expressions: user_grade (type: double), user_grade_type (type: double)

Reducer 2

Reduce Operator Tree:

Group By Operator

keys: KEY._col0 (type: string)

mode: mergepartial

outputColumnNames: _col0

Statistics: Num rows: 435430 Data size: 23429814 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 435430 Data size: 23429814 Basic stats: COMPLETE Column stats: NONE

Reducer 3

Reduce Operator Tree:

Merge Join Operator

condition map:

Left Outer Join0 to 1

Left Outer Join0 to 2

keys:

0 lower(_col107) (type: string)

1 _col0 (type: string)

2 _col0 (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col133, _col145, _col146

Statistics: Num rows: 262955293 Data size: 120275865393 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col107 (type: string)

sort order: +

Map-reduce partition columns: _col107 (type: string)

Statistics: Num rows: 262955293 Data size: 120275865393 Basic stats: COMPLETE Column stats: NONE

value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col22 (type: string), _col24 (type: string), _col30 (type: double), _col31 (type: double), _col32 (type: double), _col34 (type: double), _col44 (type: string), _col67 (type: double), _col68 (type: string), _col69 (type: double), _col72 (type: string), _col78 (type: double), _col83 (type: bigint), _col103 (type: double), _col108 (type: string), _col109 (type: string), _col110 (type: int), _col111 (type: string), _col117 (type: double), _col118 (type: double), _col125 (type: string), _col126 (type: string), _col133 (type: int), _col145 (type: string), _col146 (type: string)

Reducer 4

Reduce Operator Tree:

Merge Join Operator

condition map:

Left Outer Join0 to 1

keys:

0 _col107 (type: string)

1 email (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col133, _col145, _col146, _col157

Statistics: Num rows: 289250828 Data size: 132303454799 Basic stats: COMPLETE Column stats: NONE

Map Join Operator

condition map:

Left Outer Join0 to 1

keys:

0 _col72 (type: string)

1 usr_union_logon_code (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col133, _col145, _col146, _col157, _col163

input vertices:

1 Map 12

Statistics: Num rows: 318175917 Data size: 145533803433 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: _col0 (type: bigint), _col1 (type: string), _col3 (type: string), _col2 (type: string), _col4 (type: string), _col108 (type: string), _col109 (type: string), _col110 (type: int), _col107 (type: string), substr(_col107, instr(_col107‘@‘), (instr(_col107‘.‘) - instr(_col107‘@‘))) (type: string), _col44 (type: string), _col22 (type: string), _col24 (type: string), _col30 (type: double), _col32 (type: double), _col31 (type: double), _col67 (type: double), _col68 (type: string), _col69 (type: double), _col111 (type: string), _col34 (type: double), _col83 (type: bigint), CASE WHEN ((_col78 = 1.0)) THEN (1) ELSE (0) END (type: int), CASE WHEN (_col110 is not null) THEN (1) ELSE (0) END (type: int), CASE WHEN ((_col1 like ‘%@sina%‘)) THEN (1) WHEN ((_col1 like ‘%@pingan%‘)) THEN (2) WHEN (((_col1 like ‘%@alipay%‘) and (not ((_col117) IN (1, 2) and (_col118 = 1.0))))) THEN (3) WHEN ((((_col1 like ‘%@alipay%‘) and (_col117) IN (1, 2)) and (_col118 = 1.0))) THEN (4) WHEN ((_col1 like ‘%@163%‘)) THEN (5) WHEN ((_col1 like ‘%@kaixin001%‘)) THEN (6) WHEN ((_col1 like ‘%@139%‘)) THEN (7) WHEN ((_col1 like ‘%@msn.com%‘)) THEN (8) WHEN ((_col1 like ‘%@anyue%‘)) THEN (9) WHEN ((_col1 like ‘%@qq%‘)) THEN (10) ELSE (0) END (type: int), CASE WHEN (((_col1 like ‘%@b2b%‘) or (_col1 = ‘yuxiaolan‘))) THEN (1) ELSE (0) END (type: int), CASE WHEN (_col145 is not null) THEN (1) ELSE (0) END (type: int), CASE WHEN (_col146 is not null) THEN (1) ELSE (0) END (type: int), CASE WHEN ((_col125 = ‘1‘)) THEN (1) ELSE (0) END (type: int), _col126 (type: string), _col157 (type: double), ‘‘ (type: string), ‘‘ (type: string), ‘‘ (type: string), ‘‘ (type: string), _col103 (type: double), CASE WHEN (_col72 is null) THEN (null) WHEN ((_col72 is not null and _col163 is null)) THEN (-999999) ELSE (_col163) END (type: bigint), _col72 (type: string), _col133 (type: int)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38

Statistics: Num rows: 318175917 Data size: 145533803433 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 318175917 Data size: 145533803433 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Reducer 6

Reduce Operator Tree:

Group By Operator

keys: KEY._col0 (type: string)

mode: mergepartial

outputColumnNames: _col0

Statistics: Num rows: 2568953 Data size: 256895341 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Map-reduce partition columns: _col0 (type: string)

Statistics: Num rows: 2568953 Data size: 256895341 Basic stats: COMPLETE Column stats: NONE

Reducer 8

Reduce Operator Tree:

Merge Join Operator

condition map:

Left Outer Join0 to 1

Left Outer Join0 to 2

Left Outer Join0 to 3

keys:

0 id (type: bigint)

1 end_user_id (type: bigint)

2 end_usr_id (type: bigint)

3 end_user_id (type: bigint)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col22, _col24, _col30, _col31, _col32, _col34, _col44, _col67, _col68, _col69, _col72, _col78, _col83, _col103, _col107, _col108, _col109, _col110, _col111, _col117, _col118, _col125, _col126, _col133

Statistics: Num rows: 119525131 Data size: 54670846721 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: lower(_col107) (type: string)

sort order: +

Map-reduce partition columns: lower(_col107) (type: string)

Statistics: Num rows: 119525131 Data size: 54670846721 Basic stats: COMPLETE Column stats: NONE

value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col22 (type: string), _col24 (type: string), _col30 (type: double), _col31 (type: double), _col32 (type: double), _col34 (type: double), _col44 (type: string), _col67 (type: double), _col68 (type: string), _col69 (type: double), _col72 (type: string), _col78 (type: double), _col83 (type: bigint), _col103 (type: double), _col107 (type: string), _col108 (type: string), _col109 (type: string), _col110 (type: int), _col111 (type: string), _col117 (type: double), _col118 (type: double), _col125 (type: string), _col126 (type: string), _col133 (type: int)

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

时间: 2024-11-18 04:26:25

hive on tez sql 优化的相关文章

hive sql 优化 数据倾斜

此脚本运行速度慢,主要是reduce端数据倾斜导致的,了解到dw.fct_traffic_navpage_path_detl表是用来收集用户点击数据的,那么最终 购物车和下单的点击肯定极少,所以此表ordr_code字段为空和cart_prod_id字段为NULL的数据量极大,如下所示: select ordr_code,count(*) as a from dw.fct_traffic_navpage_path_detl  where ds = '2015-05-10'  group by o

hive sql 优化

sql优化: ---------------------------------------------------------------- 数据倾斜的处理方式: -- Q: 活动数据 和 对应的维表进行关联,其中某个活动特别的大. A: 1) 给关联健加入一个随机的 1-10的值 2)将维度表 的关联健, 每个加上 1-10的值,将维度表扩充十倍. 3)然后将2个表进行join,从而来消除数据倾斜. -- 尽量不使用count distinct 1) 通过select子查询优化 2) 通过建

hive查询注意及优化tips

Hive是将符合SQL语法的字符串解析生成可以在Hadoop上执行的MapReduce的工具.使用Hive尽量按照分布式计算的一些特点来设计sql,和传统关系型数据库有区别, 所以需要去掉原有关系型数据库下开发的一些固有思维. 基本原则: 1:尽量尽早地过滤数据,减少每个阶段的数据量,对于分区表要加分区,同时只选择需要使用到的字段 select ... from A join B on A.key = B.key where A.userid>10 and B.userid<10 and A.

hive on tez配置

1.Tez简介 Tez是Hontonworks开源的支持DAG作业的计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升MapReduce作业的性能.Tez并不直接面向最终用户--事实上它允许开发者为最终用户构建性能更快.扩展性更好的应用程序 2.编译tez 本文记录Tez 0.8.5的编译过程,之前的Tez版本都是源码包,最新的版本虽然提供了编译后的tar包,但是大部分情况下是针对特定的Hadoop版本,如果和我们的Hadoop版本不一致,可能某个时刻会出现一些未知的问题,所以为了稳定

hive高级操作(优化,数据倾斜优化)

2019/2/21 星期四 hive高级操作(优化,数据倾斜优化) 分区表/桶表应用,skew,map-join //见hive的基本语法行列转换 hive 优化hive 优化思想Explain 的使用经典案例(distinct count) 数据倾斜的原因操作:关键词 情形 后果1.Join 其中一个表较小,但是key 集中分发到某一个或几个Reduce 上的数据远高于平均值 :2.大表与大表,但是分桶的判断字段0 值或空值过多这些空值都由一个reduce 处理,非常慢:3.group by

配置 Hive On Tez

配置 Hive On Tez 标签(空格分隔): hive Tez 部署底层应用 简单介绍 介绍:tez 是基于hive 之上,可以将sql翻译解析成DAG计算的引擎.基于DAG 与mr 架构本身的优缺点,tez 本身经过测试一般小任务在hive mr 的2-3倍速度左右,大任务7-10倍左右,根据情况不同可能不一样. 对于 Tez-0.9.0 以及更高版本, Tez 需要 Apache Hadoop 版本为 2.7.0 或更高 安装 Apache Hadoop 2.7.0 或更高版本,这里选取

你真的了解“SQL”吗?《SQL优化最佳实践》作者带你重新了解SQL

一.SQL :一种熟悉又陌生的编程语言 这里有几个关键词:"熟悉"."陌生"."编程语言". 说它"熟悉",是因为它是DBA和广大开发人员,操作数据库的主要手段,几乎每天都在使用.说它"陌生",是很多人只是简单的使用它,至于它是怎么工作的?如何才能让它更高效的工作?却从来没有考虑过. 这里把SQL归结为一种"编程语言",可能跟很多人对它的认知不同.让我们看看它的简单定义(以下内容摘自百度

sql优化

1.all: 全表扫描,遍历全表找到匹配的行 index:索引全扫描,遍历整个索引来查询匹配的行 range:索引范围扫描,常见于<,>,>=,between等操作符 ref: 使用非唯一索引扫描或唯一索引的前缀扫描,返回匹配某个单独值的记录行 eq_ref:类似ref,区别就是使用的索引是唯一索引,对于每个索引键值,表中只有一条记录匹配.简单来说,就是多表连接中使用primary key或者unique index 作为关联条件 const/system:单表中最多有一个匹配行,查询起

sql优化(oracle)- 第三部分 &#160;sql优化总结

第三部分  sql优化总结        1. 优化一般原则        2. 具体注意事项 1. SQL优化一般性原则 1)目标:减少服务器资源消耗(主要是磁盘IO) 2)设计: 1. 尽量依赖oracle优化器 2. 合适的索引(数据重复量大的列不要简历二叉树索引,可以使用位图索引: 对应数据操作频繁的表,索引需要定期重建,减少失效的索引和碎片) 3)编码: 1. 利用索引 2. 合理利用临时表 3. 避免写过于复杂的sql: 4. 尽量减小事务的粒度 2. 具体注意事项 1)查询时尽量使