Hive 教程(四)-分区表与分桶表

在 hive 中分区表是很常用的，分桶表可能没那么常用，本文主讲分区表。

概念

分区表

在 hive 中，表是可以分区的，hive 表的每个区其实是对应 hdfs 上的一个文件夹；

可以通过多层文件夹的方式创建多层分区；

通过文件夹把数据分开

分桶表

分桶表中的每个桶对应 hdfs 上的一个文件；

通过文件把数据分开

在查询时可以通过 where 指定分区（分桶），提高查询效率

分区表基本操作

1. 创建分区表

partitoned by 指定分区，后面加分区字段和分区字段类型，可以加多个字段，前面是父路径，后面是子路径

create table student_p(id int,name string,sexex string,age int,dept string)
partitioned by(part string)
row format delimited fields terminated by ‘,‘
stored as textfile;

分区表相当于给表加了一个字段，然后给这个字段赋予不同的 value，每个 value 对应一个分区，这个 value 对应 hdfs 上文件夹的名字

2. 写入数据

1, zhangsan, f, 30, a,
2, lisi, f, 39, b,
3, wangwu, m, 26, c,

写入两次，每次设置不同的分区

load data local inpath ‘/usr/lib/hive2.3.6/2.csv‘ into table student_p partition(part=321);
load data local inpath ‘/usr/lib/hive2.3.6/2.csv‘ into table student_p partition(part=456);

3. 写入数据后看看长啥样

hive> select * from student_p;
OK
1    zhangsan    f    20    henan    321
2    lisi    f    30    shanghai    321
3    wangwu    m    40    beijing    321
1    zhangsan    f    20    henan    456
2    lisi    f    30    shanghai    456
3    wangwu    m    40    beijing    456
Time taken: 0.287 seconds, Fetched: 6 row(s)

4. hdfs 上看看长啥样

5. 查看某个分区

hive> select * from student_p where part=321;

6. 数据库里看看元数据信息

分区信息保存在 PARTITIONS 表中

还有其他与 PARTITIONS 相关的表，自己可以看看

小结：每个分区对应一个文件夹，而且这个文件夹必须存储到元数据中；

也就是说，如果这个文件不在元数据中，那么即使他存在，也不是分区表中的一个分区，通过表查询不到

增加分区

加载数据时会自动增加分区，也可以不加载数据，单独创建分区

增加一个分区

hive> alter table student_p add partition(part=999);

增加多个分区

hive> alter table student_p add partition(part=555) partition(part=666);

删除分区

删除一个分区

hive> alter table student_p drop partition(part=555);
Dropped the partition part=555
OK
Time taken: 0.675 seconds

删除多个分区

hive> alter table student_p drop partition(part=666), partition(part=999);
Dropped the partition part=666
Dropped the partition part=999
OK
Time taken: 0.464 seconds

查看分区数

hive> show partitions student_p;
OK
part=321
part=456
Time taken: 0.28 seconds, Fetched: 2 row(s)

查看分区表结构

hive> desc formatted student_p;
OK
# col_name                data_type               comment             

id                      int
name                    string
sexex                   string
age                     int
dept                    string                                      

# Partition Information
# col_name                data_type               comment             

part                    string                                      

# Detailed Table Information
Database:               hive1101
Owner:                  root
CreateTime:             Fri Nov 01 02:00:25 PDT 2019
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://hadoop10:9000/usr/hive_test/student_p
Table Type:             MANAGED_TABLE

表与数据关联

之前我们讲到如果一个文件夹在表目录下，但是不在元数据中，那么通过表是查不到这个数据的。

那如何把这种数据通过表读出来？必须把他们关联起来，有三种方式

上传数据后修复

1. 直接上传数据到 hdfs

hive> dfs -mkdir -p /usr/hive_test/student_p/part=888;
hive> dfs -put /usr/lib/hive2.3.6/2.csv /usr/hive_test/student_p/part=888;

在 hdfs 上直接建了一个目录，并且这个目录在表目录下，然后给这个目录上传一个文件

2. 查询该分区数据，无果

3. 修复表

hive> msck repair table student_p;
OK
Partitions not in metastore:    student_p:part=888
Repair: Added partition to metastore student_p:part=888
Time taken: 0.502 seconds, Fetched: 2 row(s)

就是把分区添加到元数据

4. 查询可查到数据

上传数据后添加分区

首先执行上面的 1 2 步；

然后给表添加分区，把新建的文件夹添加给表做分区

hive> alter table student_p add partition(part=888);

创建文件夹后 load 数据到分区

我们知道 load 是会自动创建分区的，所以这样肯定可以

创建二级分区

二级分区，也就是多层分区，也就是多层路径

创建多级分区表

create table student1102(id int,name string,sexex string,age int,dept string)
partitioned by(month string, day int)
row format delimited fields terminated by ‘,‘
stored as textfile;

month 一级，day 是month 下一级

load 数据

load data local inpath ‘/usr/lib/hive2.3.6/2.csv‘ into table student1102 partition(month=‘11‘, day=2);

在 hdfs 一看就知道怎么回事了

查询数据

hive> select * from student1102 where month=11 and day=2;

加个 and 就可以了

加载数据

load data inpath ‘/user/tuoming/test/test‘  into table part_test_3 partition(month_id=‘201805‘,day_id=‘20180509‘); 追加
load data inpath ‘/user/tuoming/test/test‘ overwrite into table part_test_3 partition(month_id=‘201805‘,day_id=‘20180509‘);  覆盖

insert overwrite table part_test_3 partition(month_id=‘201805‘,day_id=‘20180509‘) select * from part_test_temp; 覆盖
insert into part_test_3 partition(month_id=‘201805‘,day_id=‘20180509‘) select * from part_test_temp; 追加

动态分区

参考下面的参考资料

参考资料：

https://blog.csdn.net/afafawfaf/article/details/80249974

原文地址：https://www.cnblogs.com/yanshw/p/11778127.html

时间： 2024-12-13 21:24:33

Hive 教程(四)-分区表与分桶表的相关文章

Hive SQL之分区表与分桶表

Hive sql是Hive 用户使用Hive的主要工具.Hive SQL是类似于ANSI SQL标准的SQL语言,但是两者有不完全相同.Hive SQL和Mysql的SQL方言最为接近,但是两者之间也存在着显著的差异,比如Hive不支持行级数据的插入.更新和删除,也不支持事务操作. 注: HIVE 2.*版本之后开始支持事务功能,以及对单条数据的插入更新等操作 Hive的相关概念 Hive数据库 Hive中的数据库从本质上来说仅仅就是一个目录或者命名空间,但是对于具有很多用户和组的集群来说,这个

Hive之分区以及bucket分桶认识理解

1. 桶的概念: 对于每一个表(table)或者分区, Hive可以进一步组织成桶(没有分区能分桶吗?),也就是说桶是更为细粒度的数据范围划分.Hive也是针对某一列进行桶的组织.Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中.把表(或者分区)组织成桶(Bucket)有两个理由:(1).获得更高的查询处理效率.桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构.具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (M

Hive分区表与分桶

分区表在Hive Select查询中,一般会扫描整个表内容,会消耗很多时间做没必要的工作. 分区表指的是在创建表时,指定partition的分区空间. 分区语法 create table tablename name string ) partitioned by(key type,-) create table if not exists employees( name string, salary string, subordinates array<string>, deduction

HIVE—索引、分区和分桶的区别

一.索引简介 Hive支持索引,但是Hive的索引与关系型数据库中的索引并不相同,比如,Hive不支持主键或者外键. Hive索引可以建立在表中的某些列上,以提升一些操作的效率,例如减少MapReduce任务中需要读取的数据块的数量. 为什么要创建索引? Hive的索引目的是提高Hive表指定列的查询速度.没有索引时,类似'WHERE tab1.col1 = 10' 的查询,Hive会加载整张表或分区,然后处理所有的rows,但是如果在字段col1上面存在索引时,那么只会加载和处理文件的一部分

Hive里的分区和分桶再谈

分桶是细粒度的,分桶是不同的文件. 分区是粗粒度的,即相当于,表下建立文件夹.分区是不同的文件夹. 桶在对指定列进行哈希计算时,会根据哈希值切分数据,使每个桶对应一个文件. 里面的id是哈希值,分过来的. 分桶,一般用作数据倾斜和数据抽样方面.由此,可看出是细粒度. Hive 中创建分区表没有什么复杂的分区类型(范围分区.列表分区.hash 分区,混合分区等).分区列也不是表中的一个实际的字段,而是一个或者多个伪列.意思是说,在表的数据文件中实际并不保存分区列的信息与数据. 注意:普通表(外部

Hive桶表

桶(bucket)是指将表或分区中指定列的值为key进行hash,hash到指定的桶中,这样可以支持高效采样工作. 抽样(sampling)可以在全体数据上进行采样,这样效率自然就低,它还是要去访问所有数据.而如果一个表已经对某一列制作了bucket,就可以采样所有桶中指定序号的某个桶,这就减少了访问量. 针对桶的操作,总共有四步: 1).开启桶的服务 Hive > set hive.enforce.buketing=true; 2).创建桶表首先,我们来看如何告诉Hive—个表应该被划分成桶

大数据--hive分桶查询&&压缩方式

一.分桶及抽样查询 1.分桶表创建 --------------------------------------- hive (db_test)> create table stu_buck(id int,name string) > clustered by(id) > into 4 buckets > row format delimited fields terminated by '\t';OKTime taken: 0.369 seconds --------------

Hive中的分桶

对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分.Hive也是针对某一列进行桶的组织.Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中. 把表(或者分区)组织成桶(Bucket)有两个理由: (1)获得更高的查询处理效率.桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构.具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现.比

hive 中分桶抽样查询的原理刨析

先把大家都知道的分桶抽样查询的语法以及用法po出 select * from 分桶表 tablesample(bucket x out of y on 分桶字段); 假设当前分桶表,一共分了z桶! x: 代表从当前的第几桶开始抽样 0<x<=y y: z/y 代表一共抽多少桶! y必须是z的因子或倍数! 怎么抽: 从第x桶开始抽,当y<=z每间隔y桶抽一桶,直到抽满 z/y桶举例1: select * from stu_buck2 tablesample(bucket 1 out o