Hive综合案例分析之不正常订单状态统计

需求

订单有5个状态:创建、捡货、发送、送达、取消

统计:创建和捡货之间不能操作2小时,创建到发送时间不能操作4小时,创建到送达之间不能超过48小时。

知识点

1)external table

2)desc formatted的使用

3)virtual column

4)Alter FILEFORMAT

5)COALESCE、unix_timestamp的使用

6)PARQUET

实现

外部表

订单创建表:

CREATE EXTERNAL TABLE order_created (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;

加载表数据

load data local inpath ‘/home/spark/software/data/order_created.txt‘ overwrite into table order_created;

导入数据另一种方式:在创建表的时候通过location指定文件目录来导入数据

1)在创建表时location指定的目录下已经存在文件

CREATE  TABLE test_load1 (
    id STRING
  , name  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘
LOCATION ‘/student‘;

2)在创建表时location指定的目录下还不存在文件

CREATE  TABLE test_load2 (
    id STRING
  , name  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘
LOCATION ‘/load‘;

将文件数据上传到指定目录下:

hadoop fs -put /home/spark/software/data/student.txt /load/

查询数据

select * from order_created;
10703007267488  2014-05-01 06:01:12.334+01
10101043505096  2014-05-01 07:28:12.342+01
10103043509747  2014-05-01 07:50:12.33+01
10103043501575  2014-05-01 09:27:12.33+01
10104043514061  2014-05-01 09:03:12.324+01

静态分区表

CREATE TABLE order_created_partition (
    orderNumber STRING
  , event_time  STRING
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;

分区表加载数据方式一:

load data local inpath ‘/home/spark/software/data/order_created.txt‘ overwrite into table order_created_partition PARTITION(event_month=‘2014-05‘);

数据查询

select * from order_created_partition where event_month=‘2014-05‘;   #不跑mapreduce
10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05

分区表加载数据方式二:

第一步:创建hdfs目录:在hdfs目录:/user/hive/warehouse/order_created_partition目录下创建event_month=2014-06

hadoop fs -mkdir /user/hive/warehouse/order_created_partition/event_month=2014-06

第二步:拷贝数据到新创建的目录下:

hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_partition/event_month=2014-06

第三步:添加新分区数据到元数据信息中

msck repair table order_created_partition;

执行日志信息:

Partitions not in metastore:    order_created_partition:event_month=2014-06
Repair: Added partition to metastore order_created_partition:event_month=2014-06

查询event_month=2014-06分区的数据:

select * from order_created_partition where event_month=‘2014-06‘;
10703007267488  2014-05-01 06:01:12.334+01      2014-06
10101043505096  2014-05-01 07:28:12.342+01      2014-06
10103043509747  2014-05-01 07:50:12.33+01       2014-06
10103043501575  2014-05-01 09:27:12.33+01       2014-06
10104043514061  2014-05-01 09:03:12.324+01      2014-06

查看分区表已有的所有分区:

show partitions order_created_partition;

查看分区表已有的指定分区:

SHOW PARTITIONS order_created_partition PARTITION(event_month=‘2014-06‘);

查看表字段信息:

desc order_created_partition;
desc extended order_created_partition;
desc formatted order_created_partition;
desc formatted order_created_partition partition(event_month=‘2014-05‘);

动态分区表

CREATE TABLE order_created_dynamic_partition (
    orderNumber STRING
  , event_time  STRING
)
PARTITIONED BY (event_month string);

加载数据:

insert into table order_created_dynamic_partition PARTITION (event_month)
select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

报错:

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column.
To turn this off set hive.exec.dynamic.partition.mode=nonstrict

解决方案:

set hive.exec.dynamic.partition.mode=nonstrict;

重新执行:

insert into table order_created_dynamic_partition PARTITION (event_month)
select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

查询数据:

select * from order_created_dynamic_partition;
10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05

Parquet类型表以及ALTER FILEFORMAT

创建存储方式为parquet类型的分区表

CREATE TABLE order_created_dynamic_partition_parquet (
    orderNumber STRING
  , event_time  STRING
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘
STORED AS parquet;

查看表信息:需要重点关注下SerDe Library/InputFormat/OutputFormat三个属性的区别

desc formatted order_created_dynamic_partition_parquet;
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

desc formatted order_created_dynamic_partition;
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

插入数据:

insert into table order_created_dynamic_partition_parquet PARTITION (event_month)
select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

查询数据:

select * from order_created_dynamic_partition_parquet;
10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05

关注下hdfs上存储的文件:

hadoop fs -text /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-05/000000_0

查看发现是“乱码”的,文件内容不能直接肉眼识别。

注意:如下的操作是将textfile的文件拷贝到parquet类型的表中

在hdfs目录:/user/hive/warehouse/order_created_dynamic_partition_parquet目录下创建event_month=2014-06

hadoop fs -mkdir /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06

拷贝数据到新创建的目录下:

hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06

添加新分区数据到元数据信息中:

msck repair table order_created_dynamic_partition_parquet;

查询数据:

select * from order_created_dynamic_partition_parquet;

报错,信息如下:

Failed with exception java.io.IOException:java.lang.RuntimeException:
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt
is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [52, 43, 48, 49]

报错原因:因为event_month=2014-06不是parquet文件,而是普通的文本类型文件

而读取parquet分区的数据是正常的,比如:

select * from order_created_dynamic_partition_parquet where event_month=‘2014-05‘;

解决方案:

hive0.12版本以及以下版本的解决方案

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘;
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET FILEFORMAT textfile;

必须要同时修改SERDE又要修改FILEFORMAT才能起作用;

手工设置以后就可以进行正常的查询操作了。

注意查看SerDe LazySimpleSerDe/INPUTFORMAT/OUTPUTFORMAT:

查看event_month=‘2014-06‘分区的Storage Information:

desc formatted order_created_dynamic_partition_parquet partition(event_month=‘2014-06‘);
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat 

查看event_month=‘2014-05‘分区的Storage Information:

desc formatted order_created_dynamic_partition_parquet partition(event_month=‘2014-05‘);
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

查看整个表的Storage Information:

desc formatted order_created_dynamic_partition_parquet;
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

如上操作后得出结论:

1)order_created_dynamic_partition_parquet中的不同分区可以有不同的存储类型;

2)表的存储类型还是创建时指定的存储类型;

hive0.13版本以及以上版本的解决方案:

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET FILEFORMAT parquet;

insert数据到parquet表

insert into table order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) select orderNumber, event_time from order_created;

查看order_created_dynamic_partition_parquet目录下的文件发现一个partition中有两种不同类型的FILEFORMAT了

hadoop fs -ls /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06
/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/000000_0      #Parquet类型
/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt   #TEXTFILE类型

因为原先的event_month=‘2014-06‘已经被我们手工修改成TEXTFILE类型了,而insert into进去的是Parquet类型(insert into进去的存储类型和创建表时指定的存储类型一致)。

select * from order_created_dynamic_partition_parquet; 查询报错

查看表信息:

desc formatted order_created_dynamic_partition_parquet partition(event_month=‘2014-06‘);
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 

发现SerDe Library是Parquet类型了,原先手工设置的就无效了,变成创建表时指定的存储类型了。需要重新设置:

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘;
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET FILEFORMAT textfile;

select * from order_created_dynamic_partition_parquet;

能查询出部分数据,其中有些是Parquet格式的就显示不出来。这可能是hive的一个bug

Hive虚拟列

INPUT__FILE__NAME: 输入文件的文件名

BLOCK__OFFSET__INSIDE__FILE: 这一行数据所在的文件中的偏移量

select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE  from order_created_dynamic_partition;
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      0
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      42
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       84
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       125
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      166

求这一行在文件的第几行

select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1 from order_created_dynamic_partition;
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      1
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      2
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       3.0487804878
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       4.0487804878
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      4.95238095238
select INPUT__FILE__NAME, ordernumber, event_time, round(BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1) from order_created_dynamic_partition;
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      1
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      2
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       3
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       4
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      5

订单捡货表

创建表:

CREATE TABLE order_picked (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;

加载数据:

load data local inpath ‘/home/spark/software/data/order_picked.txt‘ overwrite into table order_picked;

查询数据:

select * from order_picked;
10703007267488  2014-05-01 07:02:12.334+01
10101043505096  2014-05-01 08:29:12.342+01
10103043509747  2014-05-01 10:55:12.33+01

订单发货表

创建表:

CREATE TABLE order_shipped (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;

加载数据:

load data local inpath ‘/home/spark/software/data/order_shipped.txt‘ overwrite into table order_shipped;

查询数据:

select * from order_shipped;
10703007267488  2014-05-01 10:00:12.334+01
10101043505096  2014-05-01 18:39:12.342+01

订单收货表

创建表:

CREATE TABLE order_received (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;

加载数据:

load data local inpath ‘/home/spark/software/data/order_received.txt‘ overwrite into table order_received;

查询数据:

select * from order_received;
10703007267488  2014-05-02 12:12:12.334+01

订单取消表

创建表:

CREATE TABLE order_cancelled (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;

加载数据:

load data local inpath ‘/home/spark/software/data/order_cancelled.txt‘ overwrite into table order_cancelled;

查询数据:

select * from order_cancelled;
10103043501575  2014-05-01 12:12:12.334+01

行列互换操作:

方法一:采用union all

CREATE TABLE order_tracking AS
SELECT orderNumber
     , max(CASE WHEN type_id="order_created"   THEN event_time ELSE ‘0‘ END) AS order_created_ts
     , max(CASE WHEN type_id="order_picked"    THEN event_time ELSE ‘0‘ END) AS order_picked_ts
     , max(CASE WHEN type_id="order_shipped"   THEN event_time ELSE ‘0‘ END) AS order_shipped_ts
     , max(CASE WHEN type_id="order_received"  THEN event_time ELSE ‘0‘ END) AS order_received_ts
     , max(CASE WHEN type_id="order_cancelled" THEN event_time ELSE ‘0‘ END) AS order_cancelled_ts
FROM (
    select orderNumber, "order_created"   as type_id, event_time FROM order_created
  UNION ALL
    select orderNumber, "order_picked"    as type_id, event_time FROM order_picked
  UNION ALL
    select orderNumber, "order_shipped"   as type_id, event_time FROM order_shipped
  UNION ALL
    select orderNumber, "order_received"  as type_id, event_time FROM order_received
  UNION ALL
    select orderNumber, "order_cancelled" as type_id, event_time FROM order_cancelled
) u
group by orderNumber;
select * from order_tracking order by order_created_ts limit 5;
10703007267488  2014-05-01 06:01:12.334+01      2014-05-01 07:02:12.334+01      2014-05-01 10:00:12.334+01    2014-05-02 12:12:12.334+01       0
10101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    0
10103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       0       0       0
10104043514061  2014-05-01 09:03:12.324+01      0       0       0       0
10103043501575  2014-05-01 09:27:12.33+01       0       0       0       2014-05-01 12:12:12.334+01

方法二:采用join

CREATE TABLE order_tracking_join AS
select t1.orderNumber
     , t1.event_time as order_created_ts
     , t2.event_time as order_picked_ts
     , t3.event_time as order_shipped_ts
     , t4.event_time as order_received_ts
     , t5.event_time as order_cancelled_ts
from (
  select ordernumber, max(event_time) as event_time from order_created group by ordernumber
) t1
left outer join (
  select ordernumber, max(event_time) as event_time from order_picked group by ordernumber
) t2
on t1.ordernumber = t2.ordernumber
left outer join (
  select ordernumber, max(event_time) as event_time from order_shipped group by ordernumber
) t3
on t1.ordernumber = t3.ordernumber
left outer join (
  select ordernumber, max(event_time) as event_time from order_received group by ordernumber
) t4
on t1.ordernumber = t4.ordernumber
left outer join (
  select ordernumber, max(event_time) as event_time from order_cancelled group by ordernumber
) t5
on t1.ordernumber = t5.ordernumber;
select * from order_tracking_join order by order_created_ts limit 5;
10703007267488  2014-05-01 06:01:12.334+01      2014-05-01 07:02:12.334+01      2014-05-01 10:00:12.334+01    2014-05-02 12:12:12.334+01       NULL
10101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    NULL     NULL
10103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       NULL    NULL    NULL
10104043514061  2014-05-01 09:03:12.324+01      NULL    NULL    NULL    NULL
10103043501575  2014-05-01 09:27:12.33+01       NULL    NULL    NULL    2014-05-01 12:12:12.334+01

最终的统计操作:

COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0)使用说明:如果第一个参数为null的话就采用第二个参数作为第一个参数的值

方法一:采用采用union all的结果表

select orderNumber
     , order_created_ts
     , order_picked_ts
     , order_shipped_ts
     , order_received_ts
     , order_cancelled_ts
  from order_tracking
 WHERE order_created_ts != ‘0‘ AND order_cancelled_ts = ‘0‘
   AND (
    COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 2 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_shipped_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 4 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_received_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 48 * 60 * 60
   )
;
10101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    0
10103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       0       0       0

方法二:采用join的结果表

select orderNumber
     , order_created_ts
     , order_picked_ts
     , order_shipped_ts
     , order_received_ts
     , order_cancelled_ts
  from order_tracking_join
 WHERE order_created_ts IS NOT NULL AND order_cancelled_ts IS NULL
   AND (
    COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 2 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_shipped_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 4 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_received_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 48 * 60 * 60
   )
;
10101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    NULL     NULL
10103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       NULL    NULL    NULL
时间: 2024-11-05 22:02:43

Hive综合案例分析之不正常订单状态统计的相关文章

Hive综合案例分析之开窗函数使用

知识点: 1.Hive的窗口和分析函数进阶 CUME_DIST 小于等于当前行值的行数 / 总行数 PERCENT_RANK 当前rank值-1 / 总行数-1 NTILE 将窗口分成n片 LEAD(col, n, default) 窗口内下n行值 LAG(col, n , default) 窗口内上n行值 FIRST_VALUE 窗口内第一个值 LAST_VALUE 窗口内最后一个值 2.分析函数中包含三个分析子句 分组(Partition By) 排序(Order By) 窗口(Window

Hive综合案例分析之简易推荐系统

知识点: 1.Hive复合数据类型map与Lateral View的使用: map.str_to_map.map_keys.map_values,map与lateral view 2.通过translate进行简单数据保护: Hive转换函数进行数据保护,确保企业应用信息安全 3.Hive的窗口和分析函数入门: row_number.rank.dense_rank 创建订单表: CREATE EXTERNAL TABLE f_orders ( user_id STRING , ts STRING

Hive综合案例分析之用户上网行为分析

知识点:1.Hive复合数据类型:array collect_set collect_list array_contains sort_array 2.lateral view explode(array) lateral view out 需求: click_log : cookie_id     ad_id      time ad_list: ad_id     ad_url     catalog_list 统计: cookie_catalog: cookie_id     ad_cat

分布式事物之综合案例分析

7.1系统介绍 7.1.1. P2P介绍 P2P 金融又叫P2P信贷.其中P2P是 peer-to-peer 或 person-to-person 的简写,意思是:个人对个人.P2P金融指个人与个人间的小额借贷交易,一般需要借助电子商务专业网络平台帮助借贷双方确立借贷关系并完成相关交易手续.借款者可自行发布借款信息,包括金额.利息.还款方式和时间,实现自助式借款;投资者根据借款人发布的信息,自行决定出借金额,实现自助式借贷.目前,国家对P2P行业的监控与规范性控制越来越严格,出台了很多政策来对其

综合案例分析(sort,cut,正则)

1.    找出ifconfig "网卡名" 命令结果中本机IPv4地址 分析: 解释:要取出ip地址,首先我们可以先取出ip所在的行,即取行:可以结合head和tail,后面会有 更好的方法去取行,取列当然会想到cut命令,但是此例中,我们要考虑分隔符(空格和冒号), 因此tr的引入,会使题目变得更加简单. 答: 在这里小编仅提供一种比较好的方法. 2.查出分区空间使用率的最大百分比值 分析: 解释;先附上一张df查看的结果,比较容易解释 首先我们可以用df查看分区,很明显我们需要的

ccnp大型企业综合案例分析

这个项目主要实现思路关键点之独孤九剑: Ip地址的规范 接口对应表的整理 主次关系的整理 分清楚什么是二层技术什么是三层技术 对于相同的预配置先在记事本写好,利用crt直接粘贴复制,这样节省时间和提高效率. 几种交换协议的一句话理解: Vtp 是用来简化vlan 的配置,思科专有.公有GVRP. Vtp 配置方法:两台交换机之间用trunk 相连,配置服务端与客户端,配置相同的密码, 域名,版本.服务器配置版本高于客户机. Stp pvst mst 生成树,快速生成树,多生成树. 生成树是用来防

ENode框架Conference案例分析系列之 - 订单处理减库存的设计

前言 前面的文章,我介绍了Conference案例的业务.上下文划分.领域模型.架构,以及代码整体流程.接下来想针对案例中一些重要的场景,分别做进一步的分析.本文想先介绍一下Conference案例的核心业务场景 - 订单处理减库存的设计. 下单以及订单处理流程描述 下单过程 预订者浏览某个已发布的会议: 进入会议的详情页面,该页面显示了所有可预订的座位分类信息: 预订者选择好要预订的座位分类,录入每个分类的预定数量: 预订者点击提交按钮,提交下单请求到Server端: Server端订单处理过

2016年5月信息系统项目管理师临门一脚重点串讲(综合知识、案例分析、重点论文、计算题)

http://edu.51cto.com/course/course_id-5868.html 1.旨在帮助大家快速通过软考,少受备考的折磨与孤独. 2.28小时,不到2天的时间,快速学完100天的内容 3.着重梳理综合知识重点高频考点,快速提升大家综合知识得分能力 4.多角度剖析案例分析,提升大家案例分析应试能力. 5.从论文框架与模版.到重点论文的准备,尽在掌控. 为帮助大家提高复习效率,以最小的代价通过信息系统项目管理师,本套软考冲刺临门一脚,从以下方面进行课程优化与组合:1.信息化或计算

系统架构设计师2009-2018历年综合知识、案例分析、论文真题及答案详细解析

https://blog.csdn.net/xxlllq/article/details/85049295 ??系统架构设计师复习资料当您看了这篇文章有何疑问,可先看最后的评论,有可能您遇到的问题其他人已经提出我已回复. 2018/12/14查询成绩后知道自己通过了系统架构设计师的考试(每科满分75,及格45分),特地记录一下.最终的成绩如下: 我是在9月份决定报名参加系统架构设计师考试,主要是想借此机会督促自己学习些除工作外的知识.准备了2个月,复习时间为周末,复习方式为看教学视频和真题练习.