关于HIVE数据仓库的基本操作

[Author]: kwu

1、数据库划分：

default : 默认库，测试库。对应路径 /hdfs/hive/default

stage : 中转库对应路径 /hdfs/dw/stage

ods : 正式库对应路径 /hdfs/dw/ods

2、创建表

create EXTERNAL table test_kwu (

dateday string comment "日期：如2015-01-01",

datetime string comment "时间 : 如 11:30:01:123",

ip string comment "IP：用户本机IP或用户所在网段对外路由IP",

cookieid string comment "用户cookie：统一在用户端生成的唯一标志",

userid string comment "用户和讯注册ID ：用户在和讯网的注册ID",

logserverip string comment "记录日志服务器IP ：日志收集服务器IP",

referer string comment "来源：用户浏览网页的REFER",

requesturl string comment "访问网址：当前访问网址",

remark1 string comment "【暂时没用】：该数据无意义，由于早期加入目前不能去除",

remark2 string comment "【暂时没用】：该数据无意义，由于早期加入目前不能去除",

alexaflag string comment "ALEXA标志：这个字段也是早期加入，当用户安装alexa工具时值为1，否则为0.早期加入，目前来看应该没有任何意义了。",

ua string comment "UA ：用户浏览器UA",

wirelessflag string comment "无线频道标志：给无线频道专用的，一个单词，表示该文章对应和讯哪一个频道"

)

comment "浏览轨迹日志"

partitioned by(day string comment "按天的分区表字段")

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘

STORED AS TEXTFILE

location ‘/hive/default/test_kwu‘; --注意：此处的路径对应到数据库的路径（去掉 "/hdfs"），后缀加上table的名称。

3、装载数据

load data local inpath ‘/home/kwu/data/20150512.dat‘ overwrite into table test_kwu partition (day=‘20150512‘);

insert into table test_kwu PARTITION (day=‘20150507‘) select dateday, datetime,ip,cookieid,userid, logserverip,referer,

requesturl ,remark1,remark2,alexaflag,ua,wirelessflag from test_kwu ;

4、压缩处理

set hive.enforce.bucketing=true;

set hive.exec.compress.output=true;

set mapred.output.compress=true;

set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

insert overwrite table test_kwu PARTITION (day=‘20150507‘) select dateday, datetime,ip,cookieid,userid, logserverip,referer,

requesturl ,remark1,remark2,alexaflag,ua,wirelessflag from test_kwu ;

5、基本查询语句

查询每天的PV

select dateday,count(*) from tracklog group by dateday;

尽量避全表的聚合函数

select count(*) as cnt from tracklog group by cookieid having cnt=1 ;

可采用子查询代替

select count(t.cookieid) from (select count(cookieid) as cnt,cookieid from tracklog group by cookieid having cnt=1 ) t;

时间： 2024-07-28 17:53:34

关于HIVE数据仓库的基本操作的相关文章

高速查询hive数据仓库表中的总条数

Author: kwu 高速查询hive数据仓库中的条数.在查询hive表的条数,通常使用count(*).可是数据量大的时候,mr跑count(*)往往须要几分钟的时间. 1.传统方式获得总条数例如以下: select count(*) from ods.tracklog; 执行时间为91.208s 2.与关系库一样hive表也能够通过查询元数据来得到总条数: select d.NAME,t.TBL_NAME,t.TBL_ID,p.PART_ID,p.PART_NAME,a.PARAM_VAL

Hadoop系列之Hive(数据仓库)安装配置

Hadoop系列之Hive(数据仓库)安装配置1.在NameNode安装 cd /root/soft tar zxvf apache-hive-0.13.1-bin.tar.gz mv apache-hive-0.13.1-bin /usr/local/hadoop/hive2. 配置环境变量(每个节点都需要增加) 打开/etc/profile #添加以下内容: export HIVE_HOME=/usr/local/hadoop/hive export PATH=$HIVE_HOME/

快速查询hive数据仓库表中的总条数

Author: kwu 快速查询hive数据仓库中的条数,在查询hive表的条数,通常使用count(*),但是数据量大的时候,mr跑count(*)往往需要几分钟的时间. 1.传统方式获得总条数如下: select count(*) from ods.tracklog; 2.与关系库一样hive表也可以通过查询元数据来得到总条数: <pre name="code" class="sql">select d.NAME,t.TBL_NAME,t.TBL_I

Hive数据仓库工具安装

一.Hive介绍 Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单SQL查询功能,SQL语句转换为MapReduce任务进行运行. 优点是可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析.缺点是Hive不适合在大规模数据集上实现低延迟快速的查询. 二.安装Hive 环境:Docker(17.04.0-ce).镜像Ubuntu(16.04.3).JDK(1.8.0_144).

Hive --数据仓库工具

Hive–数据仓库工具 1.Hive核心架构 2.Hive开发环境和使用方式 3.Hive核心原理解析 4.核心概念 5.HQL查询详解 6.Hive批处理脚本开发 7.Hive函数详解 8.高级特性与调优原文:大专栏 Hive --数据仓库工具原文地址:https://www.cnblogs.com/chinatrump/p/11597075.html

Hive数据仓库

Hive 是一个基于Hadoop分布式文件系统(HDFS)之上的数据仓库架构,同时依赖于MapReduce.适用于大数据集的批处理,而不适用于低延迟快速查询. Hive将用户的HiveQL语句转换为MapReduce作业提交到Hadoop集群上,监控执行过程,最后返回结果给用户.由于Hive的元数据(Hive仓库本身的数据信息)需要不断更新.修改.读取,而由于Hadoop存在较高的延时以及作业调度的开销,因此将Hive元数据存在关系型数据库Mysql.derby中.

每日定时导入hive数据仓库的自动化脚本

[Author]: kwu 创建shell脚本,创建临时表,装载数据,转换到正式的分区表中: #!/bin/sh # upload logs to hdfs yesterday=`date --date='1 days ago' +%Y%m%d` hive -e " use stage; create table tracklog_tmp ( dateday string, datetime string, ip string , cookieid string, userid string,

通过远程jdbc方式连接到hive数据仓库

1.启动hiveserver2服务器,监听端口是10000,启动名令:hive --service hiveserver2 &;//将其放在后台进行运行,判断启动是否成功的标志是:jps,是否有RunJar进程,或者netstat -anop |grep 10000查看10000端口是否连接 ,如果可以连接,那么就可以使用beeline通过$>hive service hiveserver2这个命令连接进来 2.通过beeline的命令行连接到hiveserver2,可以直接写$>be

配置hive server2鉴权和beeline无密码链接hive数据仓库

启动hive server2服务之后使用beeline链接报一下错误beeline> !connect jdbc:hive2://localhost:10000 Connecting to jdbc:hive2://localhost:10000Enter username for jdbc:hive2://localhost:10000: hadoopEnter password for jdbc:hive2://localhost:10000: **19/03/01 22:01:59 [ma