hive学习笔记-数据操作

hive数据操作

hive命令行操作

hive -d --define <key=value> 定义一个key-value可以在命令行中使用

hive -d database <databasename> 指定使用的数据库

hive -e “hql” 不需要进入cli执行hql语句，可以在脚本中使用

hive -f fileName 将hql放到一个file文件中执行，sql语句来自file文件

hive -h hostname 访问主机，通过主机的地址

hive -H --help 打印帮助信息

hive -H --hiveconf <property=value> 使用的信息可以在这里定义property的value

hive -H --hivevar <key-value> 使用可变的命令，将一个命令重新赋值使用

hive -i filename hive的初始化文件，可以将hive的一些初始化信息放到这个文件中，比如使用自定义函数的时候可以将相应的jar包目录写进去

hive -S --silent 在shell下进入安静模式，不需要打印一些输出信息

hive -v --verbose 打印执行的详细信息，比如执行的SQL语句

实例：查询test表，并且将打印的结果放到/home/data/select_result.txt中

hive -S -e "select * from test" > /home/data/select_result.txt

含有执行的SQL语句

hive -v -e "select * from test" > /home/data/select_result.txt

执行放在文件中的SQL

hive -f /home/colin/hive-1.2.1/select_test

hive cli中使用list查看分布式缓存中的file|jar|archive（比如通过add jar添加进去的，可以通过list jar查看添加到分布式缓存中的jar包）

hive cli中使用source执行指定目录下的文件，比如执行指定目录下的一个sql文件 source /home/colin/hive-1.2.1/select_test

hive操作变量

配置变量

set val_Name=val_Value；

${hiveconf:val}

查看linux下的环境变量

${env:变量名称},env查看所有的环境变量

实例定义变量val_test,设置为yang，作为查询语句条件

set val_test=yang;

select * from test2 where name=‘${hiveconf:val_test}‘;

查看

HIVE_HOME的环境变量

select ‘${env:HIVE_HOME}‘ from test;

注：test表中有多少条记录，打印多少次路径

hive数据加载

内表数据加载

创建表时加载

create table tableName as select col_1,col_2... from tableName2;

创建表的时候指定数据位置

create table tablename(col_name typye comment ...) location ‘path‘;(path为hdfs中的路径，注意这个path是文件上层的目录，也就是说指定文件到上层目录，目录下的数据都会作为该表的数据。并且这种方式不会在hive/warehouse下创建该表的目录，因为他会把hdfs中指定的path作为该表目录操作 )

注：这种指定方式，在内表中会将数据的拥有权给当前表，当表删除的时候数据也会删除(连同上层目录)

本地数据加载

load data local inpath ‘localpath‘ [overwrite] into table tableName;

加载HDFS中数据

load data inpath ‘hdfspath‘ [overwrite] into table tableName;

注意：这种方式，是将hdfs中指定位置的数据移动到表的目录下

使用Hadoop命令拷贝数据到指定位置(hive中shell执行和Linux中shell执行)

hdfs dfs -copyFromLocal /home/data /data

hive shell中 dfs -copyFromLocal /home/data data（hadoop命令直接可以在hive中执行，同样hive也可以执行linux命令，但是需要在命令前加上！）

由查询语句加载数据

insert [overwrite|into] table tableName select col1,col2... from tablenName2 where ...

from tableNable2 insert [overwrite|into] table tableName select col1,col2... where ...

select col1,col2.. from tableName2 where ... insert [overwrite|into] table tableName;

注：可以select的字段名字可以和table中不同，hive在数据加载时候不会进行字段检测和类型检测。只有在查询的时候检测

外表数据加载

创建表的时候指定数据位置(因为外表对数据没有控制权)

create external table (col_Name type comment...) location ‘path‘;

通过insert语句，和内表一样

通过hadoop命令，和内表一样

hive分区表数据加载

内部分区表和内表数据加载类似

外部分区表和外表数据加载类似

不同之处是指定分区；在外部分区表中数据存放的层次要表的分区一致，如果分区表下没有新增分区，即使目录下有数据也是查不到的,当满足目录结构对应的时候需要添加分区 alter table tableName add partition (dt=20150820)。

load data local inpath ‘path‘ [overwrie] into table tableNmme partition(pName=‘..‘);

insert [overwrite|into] table tableName partition(pName=‘..‘) select col1,col2.. from tableName2 where ...

注意：row format分隔符如果设定多个字符起分割作用，只有第一个字符有作用

load数据的时候，字段类型不能相互转换，否则会加载为NULL

插入数据时候如果selct后的类型也不能相互转换，否则插入为NULL;

在HDFS中NULL是以\N来显示的

Hive数据导出

导出方式：

Hadoop命令的方式

get

text

通过Insert....DIRECTOR

insert overwrite [local] directory ‘path‘ [row format delimited fields terminated by ‘\t‘ lines terminated by ‘\n‘] select col1,col2.. from tableName

注：如果使用local是导到本地，否则是HDFS中，row format只对导到本地起作用(在1.2.1hive中已经能够在HDFS中使用row format了

)。

通过Shell命令加管道

通过第三方工具

实例：

hdfs dfs -get /user/hive/warehouse/test4/* ./data/newdata

hdfs dfs -text /user/hive/warehouse/test5/* > ./data/newdata(可以对多种格式进行输出，压缩、序列化等)

hive -S -e "select * from test4" | grep yang > ./data/newdata

hive动态分区

分区不确定，需要从查询结果中查看。不需要为每个分区都使用alter table添加

使用动态分区需要配置的参数：

set hive.exec.dynamic.partition=true;//使用动态分区

set hive.exec.dynamic.partition.model=nonstrick;//分区有两种方式：一种是strick有限制分区，需要有一个静态分区，且放在最前面。一种就是nostrick无限制模式

set hive.exec.max.dynamic.partitions.pernode=10000;//每个节点生成动态分区的最大个数

set hive.exec.max.dynamic.partitions=100000;//生成动态分区的最大个数

set hive.exec.max.created.fiels=150000;//一个任务最多可以创建的文件数目

set dfs.datanode.max.xcievers=8192;//限定一次最多打开的文件数

insert overwrie table test7 partition(dt) select name,time as dt from test6;

表属性操作

修改表名：

alter table tableName rename to newName

修改列明:

alter table tableName change column old_col new_col newType comment ‘....‘ after colName(如果要为第一列则将aftercolName 改为first)

增加列：

alter table tableName add columns (c1 type comment ‘..‘,c2 type comment ‘...‘)

修改表属性

查看表属性

desc formatted tablename

这个是可以要修改的表的属性信息

Table Parameters:

COLUMN_STATS_ACCURATE false

last_modified_by colin

last_modified_time 1440154819

numFiles 0

numRows -1

rawDataSize -1

totalSize 0

transient_lastDdlTime 1440154819

修改属性：

alter table tableName set tblproerties(‘propertiesName‘=‘.....‘);

比如修改comment

alter table tableName set tblproperties("comment"="xxxxx");

修改序列化信息：

无分区表

alter table tableName set serdepropertie(‘fields.delim‘=‘\t‘);

有分区表

alter table tableName partition(dt=‘xxxx‘) set serdeproperties(‘fields.delim‘=‘\t‘);

修改Location：

alter table tableName [partition(..)] set localtion ‘path‘;

内部表外部表转换：

alter table tableName set tblproperties (‘EXTERNAL‘=‘TRUE|FALSE‘);必须大写EXTERNAL

更多属性操作查看:https://cwiki.apache.org/confluence/display/Hive/Home

时间： 2024-10-11 18:35:43

hive学习笔记-数据操作

hive学习笔记-数据操作的相关文章

MongoDB学习笔记(数据操作)

hive学习笔记-表操作

memcached学习笔记5--socke操作memcached 缓存系统

树莓派学习笔记——SQLite操作简述

memcached学习笔记3--telnet操作memcached

hive 学习笔记精简

Blender学习笔记 | 02 | 操作

[Python] Python 学习 - 可视化数据操作（一）

计算机操作系统学习笔记_1_操作系统概述