count(distinct key)案例
explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘ 2013-07-21‘) type: boolean Filter Operator predicate: expr: (ds = ‘ 2013-07-21‘) type: boolean Select Operator expressions: expr: session_id type: string outputColumnNames: session_id Group By Operator aggregations: expr: count(DISTINCT session_id) bucketGroup: true keys: expr: session_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col0:0._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
group by案例
explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: max(session_id) bucketGroup: false keys: expr: city_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: string Reduce Operator Tree: Group By Operator aggregations: expr: max(VALUE._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: string outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
count(distinct key)联合group by案例
explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: count(DISTINCT session_id) bucketGroup: false keys: expr: city_id type: string expr: session_id type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col2 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col1:0._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
max(key)联合group by案例
explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: count(session_id) bucketGroup: false keys: expr: city_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
执行计划总结
select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ ;
分发的是:session_id
select max(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;
分发的是:city_id
select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;
分发的是:session_id和city_id
select count(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;
分发的是:city_id
得出数据倾斜的结论:
join、group by、 count(distinct key)容易出现数据倾斜;
max、count等聚合函数并不会导致数据倾斜。
案例中的trackinfo建表语句
create table trackinfo ( id bigint , url string , referer string , keyword string , type int , gu_id string , page_id string , module_id string , link_id string , attached_info string , session_id string , tracker_u string , tracker_type int , ip string , tracker_src string , cookie string , order_code string , track_time string , end_user_id bigint , first_link string , session_view_no int , product_id string , merchant_id bigint , province_id string , city_id string , fee string , edm_activity string , edm_email string , edm_jobid string , ie_version string , platform string , internal_keyword string , result_sum string , currentpage string , link_position string , button_position string , ext_field1 string , ext_field2 string , ext_field3 string , ext_field4 string , ext_field5 string , adgroupkeywordid string , ext_field6 string , ext_field7 string , ext_field8 string , ext_field9 string , ext_field10 string , url_page_id int , url_page_value string , refer_page_id int , refer_page_value string ) partitioned by(ds string);
Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因
时间: 2024-10-13 02:35:33