十七、分段维度
本节说明分段维度的实现技术。分段维度包含连续值的分段。例如,年度销售订单分段维度可能包含有叫做“低”、“中”、“高”的三档;各档定义分别为0.01到15000、15000.01到30000.00、30000.01到99999999.99。如果一个客户的年度销售订单金额为10000,则被归为“低”档。
分段维度可以存储多个分段集合。例如,可能有一个用于促销分析的分段集合,另一个用于市场细分,可能还有一个用于销售区域计划。分段一般由用户定义,而且很少能从交易源数据直接获得。
1. 年度销售订单星型模式
本小节说明如何实现一个年度订单分段维度。需要两个新的星型模式,如下图所示。星型模式的事实表使用(关联到)已有的customer_dim和一个新的year_dim表。年维度是日期维度的子集。annual_customer_segment_fact是唯一用到annual_order_segment_dim表的表。annual_order_segement_dim是分段维度表。
annual_order_segment_dim表存储多个分段集合。在下面的例子里将两个分段集合“project alpha”和“grid”导入annual_order_segment_dim表。这两种分段集合都是按照用户的年度销售订单金额将其分类。project alpha分六段,grid分三段。下表显示了这个分段的例子。
Segment Name |
Band Name |
Start Value |
End Value |
PROJECT ALPHA |
Bottom |
0.01 |
2500.00 |
PROJECT ALPHA |
Low |
2500.01 |
3000.00 |
PROJECT ALPHA |
Mid-low |
3000.01 |
4000.00 |
PROJECT ALPHA |
Mid |
4000.01 |
5500.00 |
PROJECT ALPHA |
Mid-high |
5500.01 |
6500.00 |
PROJECT ALPHA |
Top |
6500.01 |
99999999.99 |
Grid |
LOW |
0.01 |
3000.00 |
Grid |
MED |
3000.01 |
6000.00 |
Grid |
HIGH |
6000.01 |
99999999.99 |
每一分段有一个开始值和一个结束值。 分段的粒度就是本段和下段之间的间隙。粒度必须是度量的最小可能值,在销售订单金额的示例中是0.01。最后一个分段的结束值是销售订单金额可能的最大值。下面的脚本用于建立分段维度数据仓库模式。
use dw; create table annual_order_segment_dim ( segment_sk int, segment_name varchar(30), band_name varchar(50), band_start_amount decimal(10,2), band_end_amount decimal(10,2), version int, effective_date date, expiry_date date ) clustered by (segment_sk) into 8 buckets stored as orc tblproperties (‘transactional‘=‘true‘); insert into annual_order_segment_dim values (1, ‘project alpha‘, ‘bottom‘, 0.01, 2500.00, 1, ‘1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (2, ‘project alpha‘, ‘low‘, 2500.01, 3000.00, 1, ‘1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (3, ‘project alpha‘, ‘mid-low‘, 3000.01, 4000.00, 1, ‘1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (4, ‘project alpha‘, ‘mid‘, 4000.01, 5500.00, 1, ‘1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (5, ‘project alpha‘, ‘mid_high‘, 5500.01, 6500.00, 1, ‘1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (6, ‘project alpha‘, ‘top‘, 6500.01, 99999999.99, 1, ‘ 1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (7, ‘grid‘, ‘low‘, 0.01, 3000, 1, ‘1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (8, ‘grid‘, ‘med‘, 3000.01, 6000.00, 1, ‘ 1900-01-01‘, ‘2200-01-01‘); insert into annual_order_segment_dim values (9, ‘grid‘, ‘high‘, 6000.01, 99999999.99, 1, ‘1900-01-01‘, ‘2200-01-01‘); create table year_dim ( year_sk int, year int ); create table annual_sales_order_fact ( customer_sk int, year_sk int, annual_order_amount decimal(10, 2) ); create table annual_customer_segment_fact ( segment_sk int, customer_sk int, year_sk int );
2. 初始装载
本小节说明初始装载并进行测试。下面的初始装载脚本将order_date维度表(date_dim表的一个视图)里的数据导入year_dim表,将sales_order_fact表里的数据导入annual_sales_order_fact表,将annual_sales_order_fact表里的数据导入annual_customer_segment_fact表。此脚本装载所有历史数据。
use dw; insert into year_dim select row_number() over (order by t1.year) + t2.sk_max, year from (select distinct year year from order_date_dim) t1 cross join (select coalesce(max(year_sk),0) sk_max from year_dim) t2; insert into annual_sales_order_fact select a.customer_sk, year_sk, sum(order_amount) from sales_order_fact a, year_dim c, order_date_dim d where a.order_date_sk = d.order_date_sk and c.year = d.year and d.year < 2017 group by a.customer_sk, c.year_sk; insert into annual_customer_segment_fact select d.segment_sk, a.customer_sk, a.year_sk from annual_sales_order_fact a, annual_order_segment_dim d where annual_order_amount >= band_start_amount and annual_order_amount <= band_end_amount;
执行初始装载脚本,查询annual_customer_segment_fact表确认初始装载是成功的。
select a.customer_sk csk, a.year_sk ysk, annual_order_amount amt, segment_name sn, band_name bn from annual_customer_segment_fact a, annual_order_segment_dim b, year_dim c, annual_sales_order_fact d where a.segment_sk = b.segment_sk and a.year_sk = c.year_sk and a.customer_sk = d.customer_sk and a.year_sk = d.year_sk cluster by csk, ysk, sn, bn;
查询结果如下图所示。
注意,这里是按客户代理键customer_sk分组求和来判断分段,实际情况可能是以customer_number进行分组的,因为无论客户的scd属性如何变化,一般还是认为是一个客户。
3. 定期装载
本小节说明定期装载脚本和如何测试它。除了无需装载year_dim表以外,定期装载与初始装载类似。annual_sales_order_fact表里的数据被导入annual_customer_segment_fact表。每年调度执行下面的定期装载脚本,此脚本装载前一年的销售数据。
use dw; insert into annual_sales_order_fact select a.customer_sk, year_sk, sum(order_amount) from sales_order_fact a, year_dim c, order_date_dim d where a.order_date_sk = d.order_date_sk and c.year = d.year and d.year = year(current_date) - 1 group by a.customer_sk, c.year_sk; insert into annual_customer_segment_fact select d.segment_sk, a.customer_sk, c.year_sk from annual_sales_order_fact a, year_dim c, annual_order_segment_dim d where a.year_sk = c.year_sk and c.year = year(current_date) - 1 and annual_order_amount >= band_start_amount and annual_order_amount <= band_end_amount;