大数据--hive动态分区调整

1、创建一张普通表加载数据

------------------------------------------------

hive (default)> create table person(id int,name string,location string)
> row format delimited fields terminated by ‘\t‘;
OK
Time taken: 0.415 seconds

-----------------------------------------------

hive (default)> load data local inpath ‘/root/hivetest/partition/stu‘ into table person;
Loading data to table default.person
Table default.person stats: [numFiles=1, totalSize=128]
OK
Time taken: 1.036 seconds

----------------------------------------------------

hive (default)> select * from person;
OK
person.id person.name person.location
1001 zhangsan jiangsu
1002 lisi jiangsu
1003 wangwu shanghai
1004 heiliu shanghai
1005 xiaoliuzi zhejiang
1006 xiaohei zhejiang
Time taken: 0.356 seconds, Fetched: 6 row(s)

----------------------------------------------------

2、创建一张分区表加载数据

------------------------------------------------

hive (default)> create table person_partition1(id int,name string)
> partitioned by(location string)
> row format delimited fields terminated by ‘\t‘;
OK
Time taken: 0.055 seconds

----------------------------------------------------

hive (default)> load data local inpath ‘/root/hivetest/partition/stu_par‘ into table person_partition1 partition(location=‘jiangsu‘);
Loading data to table default.person_partition1 partition (location=jiangsu)
Partition default.person_partition1{location=jiangsu} stats: [numFiles=1, numRows=0, totalSize=48, rawDataSize=0]
OK
Time taken: 0.719 seconds

------------------------------------------------------

hive (default)> select * from person_partition1 where location = ‘jiangsu‘;
OK
person_partition1.id person_partition1.name person_partition1.location
1001 zhangsan jiangsu
1002 lisi jiangsu
1003 wangwu jiangsu
1004 heiliu jiangsu
Time taken: 0.27 seconds, Fetched: 4 row(s)

-------------------------------------------------------------

3、创建一张目标分区表

--------------------------------------------------------------

hive (default)> create table target_partition(id int,name string)
> partitioned by(location string)
> row format delimited fields terminated by ‘\t‘;
OK
Time taken: 0.076 seconds

---------------------------------------------------------------

4、设置动态分区相关配置

-----------------------------------------------------------

(1)开启动态分区功能(默认true,开启)

hive (default)> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

(2)设置为非严格模式(动态分区的模式,默认strict,表示必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。)

hive (default)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
hive (default)> set hive.exec.dynamic.partition.mode=nonstrict;
hive (default)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=nonstrict

(3)在所有执行MR的节点上,最大一共可以创建多少个动态分区。(默认1000)

hive (default)> set hive.exec.max.dynamic.partitions;
hive.exec.max.dynamic.partitions=1000

(4)在每个执行MR的节点上,最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如:源数据中包含了一年的数据,即day字段有365个值,那么该参数就需要设置成大于365,如果使用默认值100,则会报错。

hive (default)> set hive.exec.max.dynamic.partitions.pernode;
hive.exec.max.dynamic.partitions.pernode=100

(5)整个MR Job中,最大可以创建多少个HDFS文件。(默认值100000)

hive (default)> set hive.exec.max.created.files;
hive.exec.max.created.files=100000

(6)当有空分区生成时,是否抛出异常。一般不需要设置。(默认false)

hive (default)> set hive.error.on.empty.partition;
hive.error.on.empty.partition=false

-------------------------------------------------------------------------------------------------

5、原表是分区表person_partition1,查询加载另一张分区表target_partition

---------------------------------------------------------------------

hive (default)> insert overwrite table target_partition partition(location) select id,name,location from person_partition1;
Query ID = root_20191004121759_d0af4f33-c1aa-4ef8-93b7-836f260660be
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there‘s no reduce operator
Starting Job = job_1570160651182_0001, Tracking URL = http://bigdata112:8088/proxy/application_1570160651182_0001/
Kill Command = /opt/module/hadoop-2.8.4/bin/hadoop job -kill job_1570160651182_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-10-04 12:18:10,599 Stage-1 map = 0%, reduce = 0%
2019-10-04 12:18:18,063 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.7 sec
MapReduce Total cumulative CPU time: 700 msec
Ended Job = job_1570160651182_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://mycluster/user/hive/warehouse/target_partition/.hive-staging_hive_2019-10-04_12-17-59_480_7824706292755053566-1/-ext-10000
Loading data to table default.target_partition partition (location=null)
Time taken for load dynamic partitions : 128
Loading partition {location=jiangsu}
Time taken for adding to write entity : 1
Partition default.target_partition{location=jiangsu} stats: [numFiles=1, numRows=4, totalSize=48, rawDataSize=44]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 0.7 sec HDFS Read: 4045 HDFS Write: 145 SUCCESS
Total MapReduce CPU Time Spent: 700 msec
OK
id name location
Time taken: 20.136 seconds

hive (default)> show partitions target_partition;
OK
partition
location=jiangsu
Time taken: 0.065 seconds, Fetched: 1 row(s)

hive (default)> select * from target_partition;
OK
target_partition.id target_partition.name target_partition.location
1001 zhangsan jiangsu
1002 lisi jiangsu
1003 wangwu jiangsu
1004 heiliu jiangsu
Time taken: 0.12 seconds, Fetched: 4 row(s)

--------------------------------------------------------------------------

6、原表是普通表person,查询加载另一张分区表target_partition

----------------------------------------------------------------------------------------

hive (default)> insert overwrite table target_partition partition(location) select id,name,location from person;
Query ID = root_20191004122151_2c6376a5-b764-4ffd-be69-4f981c00b951
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there‘s no reduce operator
Starting Job = job_1570160651182_0002, Tracking URL = http://bigdata112:8088/proxy/application_1570160651182_0002/
Kill Command = /opt/module/hadoop-2.8.4/bin/hadoop job -kill job_1570160651182_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-10-04 12:21:59,322 Stage-1 map = 0%, reduce = 0%
2019-10-04 12:22:05,702 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.75 sec
MapReduce Total cumulative CPU time: 750 msec
Ended Job = job_1570160651182_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://mycluster/user/hive/warehouse/target_partition/.hive-staging_hive_2019-10-04_12-21-51_068_5031819755791838456-1/-ext-10000
Loading data to table default.target_partition partition (location=null)
Time taken for load dynamic partitions : 357
Loading partition {location=zhejiang}
Loading partition {location=shanghai}
Loading partition {location=jiangsu}
Time taken for adding to write entity : 0
Partition default.target_partition{location=jiangsu} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
Partition default.target_partition{location=shanghai} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
Partition default.target_partition{location=zhejiang} stats: [numFiles=1, numRows=2, totalSize=28, rawDataSize=26]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 0.75 sec HDFS Read: 3817 HDFS Write: 295 SUCCESS
Total MapReduce CPU Time Spent: 750 msec
OK
id name location
Time taken: 17.561 seconds

hive (default)> show partitions target_partition;
OK
partition
location=jiangsu
location=shanghai
location=zhejiang
Time taken: 0.046 seconds, Fetched: 3 row(s)

hive (default)> select * from target_partition;
OK
target_partition.id target_partition.name target_partition.location
1001 zhangsan jiangsu
1002 lisi jiangsu
1003 wangwu shanghai
1004 heiliu shanghai
1005 xiaoliuzi zhejiang
1006 xiaohei zhejiang
Time taken: 0.112 seconds, Fetched: 6 row(s)

原文地址:https://www.cnblogs.com/jeff190812/p/11621832.html

时间: 2024-10-14 23:21:26

大数据--hive动态分区调整的相关文章

HIVE动态分区实战

一)hive中支持两种类型的分区: 静态分区SP(static partition) 动态分区DP(dynamic partition) 静态分区与动态分区的主要区别在于静态分区是手动指定,而动态分区是通过数据来进行判断.详细来说,静态分区的列实在编译时期,通过用户传递来决定的:动态分区只有在SQL执行时才能决定. 二)实战演示如何在hive中使用动态分区 1.创建一张分区表,包含两个分区dt和ht表示日期和小时 CREATE TABLE partition_table001 ( name ST

hive 动态分区与混合分区

使用hive分区,可以在查询的只查询对应分区的数据,避免了全表扫描.大大提升了查询速度. 今天我们讨论下,hive分区中的两个用法,动态分区和混合分区. hive混合分区 就是多级分区.在某个分区下继续创建分区. 比如 分区 dt=2019-03-10的fruit销售表中,继续区分apple,orange,banana的销售数据. 我们经常分开统计 各种水果的销售情况,那么使用混合分区就非常合适.    代码: 混合分区建表语句: hive动态分区: 根据指定字段,hive自动生成分区. 原文地

Hive动态分区

Hive默认是静态分区,我们在插入数据的时候要手动设置分区,如果源数据量很大的时候,那么针对一个分区就要写一个insert,比如说,我们有很多日志数据,我们要按日期作为分区字段,在插入数据的时候我们不可能手动的去添加分区,那样太麻烦了.还好,Hive提供了动态分区,动态分区简化了我们插入数据时的繁琐操作. 使用动态分区的时候必须开启动态分区(动态分区默认是关闭的),语句如下: [java] view plain copy set hive.exec.hynamic.partition=true;

大数据- Hive

构建在Hadoop之上的数据仓库,数据计算使用MR,数据存储使用HDFS 因为数据计算使用mapreduce,因此通常用于进行离线数据处理 Hive 定义了一种类 SQL 查询语言--HQL 类似SQL,但不完全相同 可认为是一个HQL-->MR的语言翻译器. 简单,容易上手 有了Hive,还需要自己写MR程序吗? Hive的HQL表达的能力有限 迭代式算法无法表达 有些复杂运算用HQL不易表达 Hive效率较低 Hive自动生成MapReduce作业,通常不够智能: HQL调优困难,粒度较粗

hive 动态分区

非常重要的动态分区属性: hive.exec.dynamic.partition  是否启动动态分区.false(不开启) true(开启)默认是 false hive.exec.dynamic.partition.mode  打开动态分区后,动态分区的模式,有 strict和 nonstrict 两个值可选,strict 要求至少包含一个静态分区列,nonstrict则无此要求.各自的好处,大家自己查看哈. hive.exec.max.dynamic.partitions 允许的最大的动态分区

大数据--hive分桶查询&&压缩方式

一.分桶及抽样查询 1.分桶表创建 --------------------------------------- hive (db_test)> create table stu_buck(id int,name string) > clustered by(id) > into 4 buckets > row format delimited fields terminated by '\t';OKTime taken: 0.369 seconds --------------

大数据--hive查询

一.全表查询和特定列查询 1.全表查询: ------------------------------------------------------------------- hive (db_test)> select * from dept;OKdept.deptno dept.dname dept.loc10 ACCOUNTING 170020 RESEARCH 180030 SALES 190040 OPERATIONS 1700Time taken: 0.306 seconds, F

大数据--hive文件存储格式

一.hive文件存储格式 Hive支持的存储数的格式主要有:TEXTFILE .SEQUENCEFILE.ORC.PARQUET. 上图左边为逻辑表,右边第一个为行式存储,第二个为列式存储. 行存储的特点: 查询满足条件的一整行数据的时候,列存储则需要去每个聚集的字段找到对应的每个列的值,行存储只需要找到其中一个值,其余的值都在相邻地方,所以此时行存储查询的速度更快. 列存储的特点: 因为每个字段的数据聚集存储,在查询只需要少数几个字段的时候,能大大减少读取的数据量:每个字段的数据类型一定是相同

hive从查询中获取数据插入到表或动态分区

(前人写的不错,很实用,负责任转发)转自:http://www.crazyant.net/1197.html Hive的insert语句能够从查询语句中获取数据,并同时将数据Load到目标表中.现在假定有一个已有数据的表staged_employees(雇员信息全量表),所属国家cnty和所属州st是该表的两个属性,我们做个试验将该表中的数据查询出来插入到另一个表employees中. 1 2 3 4 INSERT OVERWRITE TABLE employees PARTITION (cou