Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

count(distinct key)案例

explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘ 2013-07-21‘)
                  type: boolean
              Filter Operator
                predicate:
                    expr: (ds = ‘ 2013-07-21‘)
                    type: boolean
                Select Operator
                  expressions:
                        expr: session_id
                        type: string
                  outputColumnNames: session_id
                  Group By Operator
                    aggregations:
                          expr: count(DISTINCT session_id)
                    bucketGroup: true
                    keys:
                          expr: session_id
                          type: string
                    mode: hash
                    outputColumnNames: _col0, _col1
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(DISTINCT KEY._col0:0._col0)
          bucketGroup: false
          mode: mergepartial
          outputColumnNames: _col0
          Select Operator
            expressions:
                  expr: _col0
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

group by案例

explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘2013-07-21‘)
                  type: boolean
              Select Operator
                expressions:
                      expr: city_id
                      type: string
                      expr: session_id
                      type: string
                outputColumnNames: city_id, session_id
                Group By Operator
                  aggregations:
                        expr: max(session_id)
                  bucketGroup: false
                  keys:
                        expr: city_id
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: string
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: max(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col1
                  type: string
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

count(distinct key)联合group by案例

explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘2013-07-21‘)
                  type: boolean
              Select Operator
                expressions:
                      expr: city_id
                      type: string
                      expr: session_id
                      type: string
                outputColumnNames: city_id, session_id
                Group By Operator
                  aggregations:
                        expr: count(DISTINCT session_id)
                  bucketGroup: false
                  keys:
                        expr: city_id
                        type: string
                        expr: session_id
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: string
                    sort order: ++
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col2
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

max(key)联合group by案例

explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘2013-07-21‘)
                  type: boolean
              Select Operator
                expressions:
                      expr: city_id
                      type: string
                      expr: session_id
                      type: string
                outputColumnNames: city_id, session_id
                Group By Operator
                  aggregations:
                        expr: count(session_id)
                  bucketGroup: false
                  keys:
                        expr: city_id
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

执行计划总结

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ ;

分发的是:session_id

select max(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是:city_id

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是:session_id和city_id

select count(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是:city_id

得出数据倾斜的结论:

join、group by、 count(distinct key)容易出现数据倾斜;

max、count等聚合函数并不会导致数据倾斜。

案例中的trackinfo建表语句

create table trackinfo (
id      bigint                          ,
url     string                          ,
referer string                          ,
keyword string                          ,
type    int                             ,
gu_id   string                          ,
page_id string                          ,
module_id       string                  ,
link_id string                          ,
attached_info   string                  ,
session_id      string                  ,
tracker_u       string                  ,
tracker_type    int                     ,
ip      string                          ,
tracker_src     string                  ,
cookie  string                          ,
order_code      string                  ,
track_time      string                  ,
end_user_id     bigint                  ,
first_link      string                  ,
session_view_no int                     ,
product_id      string                  ,
merchant_id     bigint                  ,
province_id     string                  ,
city_id string                          ,
fee     string                          ,
edm_activity    string                  ,
edm_email       string                  ,
edm_jobid       string                  ,
ie_version      string                  ,
platform        string                  ,
internal_keyword        string          ,
result_sum      string                  ,
currentpage     string                  ,
link_position   string                  ,
button_position string                  ,
ext_field1      string                  ,
ext_field2      string                  ,
ext_field3      string                  ,
ext_field4      string                  ,
ext_field5      string                  ,
adgroupkeywordid        string          ,
ext_field6      string                  ,
ext_field7      string                  ,
ext_field8      string                  ,
ext_field9      string                  ,
ext_field10     string                  ,
url_page_id     int                     ,
url_page_value  string                  ,
refer_page_id   int                     ,
refer_page_value        string ) partitioned by(ds      string);

Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

时间: 2024-10-13 02:35:33

Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因的相关文章

Hive语法层面优化之一数据倾斜介绍

数据倾斜:数据分布不均匀,造成数据大量的集中到一点,造成数据热点: 由于数据并不是平均分配的,会导致各个节点上处理的数据量是不均衡的,所以数据倾斜是无法避免的: 造成数据倾斜的最根本原因:key分发不均匀造成的: 常见的数据倾斜的症状 1)  Map阶段快,reduce阶段非常慢: 2)  某些map很快,某些map很慢: 3)  某些reduce很快,某些reduce很慢: 4)  任务进度长时间维持在99%(或100%),查看任务监控页面,发现只有少量(1个或几个)reduce子任务未完成,

Hive语法层面优化之四count(distinct)引起的数据倾斜

当该字段存在大量值为null或空的记录,容易发生数据倾斜: 解决思路: count distinct时,将值为空的情况单独处理,如果是计算count distinct,可以不用处理,直接过滤,在最后结果中加1: 如果还有其他计算,需要进行group by,可以先将值为空的记录单独处理,再和其他计算结果进行union. 案例: select count(distinct end_user_id) as user_num from trackinfo; 调整为: select cast(count(

Hive架构层面优化之五合理设计表分区(静态分区和动态分区)

合理建表分区有效提高查询速度. 重要数据采用外部表存储,CREATE EXTERNAL TABLE,数据和表只是一个location的关联,drop表后数据不会丢失: 内部表也叫托管表,drop表后数据丢失:所以重要数据的表不能采用内部表的方式存储. 在全天的数据里查询某个时段的数据,性能很低效------可以通过增加小时级别的分区来改进! Trackreal为例,有三个分区: 日增量: 按日期分区: 小时增量:按日期.小时分区: 10分钟增量:按日期.小时.step分区:每个小时要导6次. 场

Hive语法层面优化之六数据倾斜常见案例

常见案例一:空值产生的数据倾斜 日志表有一部分的user_id为空或者是0的情况,导致在用user_id进行hash分桶时,会将日志由user_id为0或者为空的数据分到一个reduce上,导致数据倾斜: 如:访户未登录时,日志中的user_id为空,用user_id和用户表的user_id进行关联的时候,会将日志中的user_id为空的数据分到一起,导致了过大的空key造成数据倾斜: 解决办法:随机函数解决数据倾斜 把空值的key变成一个字符串加上随机数(只要不与真正的end_user_id的

Hive语法层面优化之七数据倾斜总结

关键字 情形 后果 join 其中一个表较小,但key集中 分发到某一个或几个reduce上的数据远高于平均值 大表与大表关联,但是分桶的判断字段0值或空值过多 这些空值都由一个reduce处理,非常慢 group by Group by维度过小,某值的数量过多 处理某值的reduce非常耗时 count distinct 某特殊值过多 处理此特殊值的reduce耗时 Hive语法层面优化之七数据倾斜总结

Hive架构层面优化之二合理利用中间结果集(单Job)

是针对单个作业,针对本job再怎么优化也不会影响到其他job: Hadoop的负载主要有两部分:CPU负载和IO负载: 问题:机器io开销很大,但是机器的cpu开销较小,另外map输出文件也较大,怎么办? 解决办法:通过设置map的中间输出进行压缩就可以了,这个不会影响最终reduce的输出. 集群中的机器一旦选定了,那么CPU就没的改变了,所以集群的最主要的负载还是IO负载: 压缩技术虽然可以降低IO负载,但是同时也加重了CPU负载,治标不治本,CPU加重了,整体性能还是上不去:如果当前CPU

Hive参数层面优化之一控制Map数

1.Map个数的决定因素 通常情况下,作业会通过input文件产生一个或者多个map数: Map数主要的决定因素有: input总的文件个数,input文件的大小和集群中设置的block的大小(在hive中可以通过set dfs.block.size命令查看,该参数不能自定义修改): 文件块数拆分原则:如果文件大于块大小(128M),那么拆分:如果小于,则把该文件当成一个块. 举例一: 假设input目录下有1个文件a,大小为780M,那么hadoop会将该文件a分隔成7个块(6个128m的块和

Hive参数层面优化之二控制Reduce数

Reduce数决定中间或落地文件数,文件大小和Block大小无关. 1.Reduce个数的决定因素 reduce个数的设定极大影响任务执行效率,不指定reduce个数的情况下,Hive会猜测确定一个reduce个数,基于以下两个设定: 参数1:hive.exec.reducers.bytes.per.reducer(每个reduce任务处理的数据量,默认为1000^3=1G) 参数2:hive.exec.reducers.max(每个作业最大的reduce数,默认为999) 计算reducer数

Oracle优化器和执行计划

1. 优化器(Optimizer)是sql分析和执行的优化工具,它负责制定sql的执行计划,负责保证sql执行效率最高,比如决定oracle以什么方式访问数据,全表扫描(full table scan)还是索引范围(index range scan)扫描,还是全索引快速扫描(index fast full scan, INDEX_FFS),对于表关联查询,是用什么方式关联.有2种优化器,RBO和CBO,从oracle 10g开始,RBO已经被弃用,但是仍可以通过hint的方式使用. 2. RBO