大数据模块开发----ETL

ETL工作的实质就是从各个数据源提取数据，对数据进行转换，并最终加载填充数据到数据仓库维度建模后的表中。只有当这些维度/事实表被填充好，ETL工作才算完成。

本项目的数据分析过程在hadoop集群上实现，主要应用hive数据仓库工具，因此，采集并经过预处理后的数据，需要加载到hive数据仓库中，以进行后续的分析过程。

1．?创建ODS层数据表1.1．?原始日志数据表

drop table if exists ods_weblog_origin;

create table ods_weblog_origin(

valid string,

remote_addr string,

remote_user string,

time_local string,

request string,

status string,

body_bytes_sent string,

http_referer string,

http_user_agent string)

partitioned by (datestr string)

row format delimited

fields terminated by ‘\001‘;

1.2．?点击流模型pageviews表

drop table if exists ods_click_pageviews;

create table ods_click_pageviews(

session string,

remote_addr string,

remote_user string,

time_local string,

request string,

visit_step string,

page_staylong string,

http_referer string,

http_user_agent string,

body_bytes_sent string,

status string)

partitioned by (datestr string)

row format delimited

fields terminated by ‘\001‘;

1.3．?点击流visit模型表

drop table if exist ods_click_stream_visit;

create table ods_click_stream_visit(

session? ???string,

remote_addr string,

inTime? ?? ?string,

outTime? ???string,

inPage? ?? ?string,

outPage? ???string,

referal? ???string,

pageVisits??int)

partitioned by (datestr string)

row format delimited

fields terminated by ‘\001‘;

2．?导入ODS层数据

load data inpath ‘/weblog/preprocessed/‘ overwrite into table

ods_weblog_origin partition(datestr=‘20130918‘);--数据导入

show partitions ods_weblog_origin;---查看分区

select count(*) from ods_weblog_origin; --统计导入的数据总数

点击流模型的两张表数据导入操作同上。

注：生产环境中应该将数据load命令，写在脚本中，然后配置在azkaban中定时运行，注意运行的时间点，应该在预处理数据完成之后。

3．?生成ODS层明细宽表3.1．?需求实现

整个数据分析的过程是按照数据仓库的层次分层进行的，总体来说，是从ODS原始数据中整理出一些中间表（比如，为后续分析方便，将原始数据中的时间、url等非结构化数据作结构化抽取，将各种字段信息进行细化，形成明细表），然后再在中间表的基础之上统计出各种指标数据。

3.2．?ETL实现

l 建明细表ods_weblog_detail:

drop table ods_weblog_detail;

create table ods_weblog_detail(

valid? ?? ?? ???string, --有效标识

remote_addr? ???string, --来源IP

remote_user? ???string, --用户标识

time_local? ?? ?string, --访问完整时间

daystr? ?? ?? ? string, --访问日期

timestr? ?? ?? ?string, --访问时间

month? ?? ?? ???string, --访问月

day? ?? ?? ?? ? string, --访问日

hour? ?? ?? ?? ?string, --访问时

request? ?? ?? ?string, --请求的url

status? ?? ?? ? string, --响应码

body_bytes_sent string, --传输字节数

http_referer? ? string, --来源url

ref_host? ?? ???string, --来源的host

ref_path? ?? ???string, --来源的路径

ref_query? ?? ? string, --来源参数query

ref_query_id? ? string, --来源参数query的值

http_user_agent string --客户终端标识

)

partitioned by(datestr string);

l 通过查询插入数据到明细宽表??ods_weblog_detail中

1、抽取refer_url到中间表 t_ods_tmp_referurl

也就是将来访url分离出host??path??query??query id

drop table if exists t_ods_tmp_referurl;

create table t_ods_tmp_referurl as

SELECT a.,b.

FROM ods_weblog_origin a

LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), ‘HOST‘, ‘PATH‘,‘QUERY‘, ‘QUERY:id‘) b as host, path, query, query_id;

注：lateral view用于和split, explode等UDTF一起使用，它能够将一列数据拆成多行数据。

UDTF(User-Defined Table-Generating Functions) 用来解决输入一行输出多行(On-to-many maping) 的需求。Explode也是拆列函数，比如Explode (ARRAY) ，array中的每个元素生成一行。

2、抽取转换time_local字段到中间表明细表 t_ods_tmp_detail

drop table if exists t_ods_tmp_detail;

create table t_ods_tmp_detail as

select b.*,substring(time_local,0,10) as daystr,

substring(time_local,12) as tmstr,

substring(time_local,6,2) as month,

substring(time_local,9,2) as day,

substring(time_local,11,3) as hour

from t_ods_tmp_referurl b;

3、以上语句可以合成一个总的语句

insert into table shizhan.ods_weblog_detail partition(datestr=‘2013-09-18‘)

select c.valid,c.remote_addr,c.remote_user,c.time_local,

substring(c.time_local,0,10) as daystr,

substring(c.time_local,12) as tmstr,

substring(c.time_local,6,2) as month,

substring(c.time_local,9,2) as day,

substring(c.time_local,11,3) as hour,

c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent

from

(SELECT

a.valid,a.remote_addr,a.remote_user,a.time_local,

a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id

FROM shizhan.ods_weblog_origin a LATERAL VIEW

parse_url_tuple(regexp_replace(http_referer, "\"", ""), ‘HOST‘, ‘PATH‘,‘QUERY‘, ‘QUERY:id‘) b as ref_host, ref_path, ref_query,

ref_query_id) c;

原文地址：https://blog.51cto.com/14473726/2432523

时间： 2024-10-01 03:37:43

大数据模块开发----ETL

大数据模块开发----ETL的相关文章

大数据模块开发之数据预处理

大数据模块开发----统计分析

大数据模块开发之数据采集

大数据模块开发之结果导出

大数据模块开发----数据仓库设计

大数据模块开发----结果导出

数据仓库工程师、大数据开发工程师、BI工程师、ETL工程师之间有什么区别？

大数据开发常用的大数据分析软件有什么？

大数据好学习吗？如何快速掌握大数据开发技能