第2节 网站点击流项目(下):7、hive的级联求和

一、hive级联求和的简单例子:

create table t_salary_detail(username string,month string,salary int)
row format delimited fields terminated by ‘,‘;

load data local inpath ‘/export/servers/hivedatas/accumulate/t_salary_detail.dat‘ into table t_salary_detail;

用户 时间 收到小费金额
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,7
A,2015-03,9
B,2015-03,11
B,2015-03,6

需求:统计每个用户每个月总共获得多少小费

select t.month,t.username,sum(salary) as salSum
from t_salary_detail t
group by t.username,t.month;

+----------+-------------+---------+--+
| t.month | t.username | salsum |
+----------+-------------+---------+--+
| 2015-01 | A | 33 |
| 2015-02 | A | 10 |
| 2015-03 | A | 16 |
| 2015-01 | B | 30 |
| 2015-02 | B | 15 |
| 2015-03 | B | 17 |
+----------+-------------+---------+--+

需求:统计每个用户累计小费
select t.username,sum(salary) as salSum
from t_salary_detail t
group by t.username;

+----------+-------------+---------+--+
| t.month | t.username | salsum | 累计小费
+----------+-------------+---------+--+
| 2015-01 | A | 33 | 33
| 2015-02 | A | 10 | 43
| 2015-03 | A | 16 | 59
| 2015-01 | B | 30 | 30
| 2015-02 | B | 15 | 45
| 2015-03 | B | 17 | 62
+----------+-------------+---------+--+

第一步:求每个用户的每个月的小费总和

select t.month,t.username,sum(salary) as salSum
from t_salary_detail t
group by t.username,t.month;

+----------+-------------+---------+--+
| t.month | t.username | salsum |
+----------+-------------+---------+--+
| 2015-01 | A | 33 |
| 2015-02 | A | 10 |
| 2015-03 | A | 16 |
| 2015-01 | B | 30 |
| 2015-02 | B | 15 |
| 2015-03 | B | 17 |
+----------+-------------+---------+--+

第二步:使用inner join 实现自己连接自己

select
A.* ,B.*
from
(select t.month,t.username,sum(salary) as salSum
from t_salary_detail t
group by t.username,t.month) A
inner join
(select t.month,t.username,sum(salary) as salSum
from t_salary_detail t
group by t.username,t.month) B
on A.username = B.username;

+----------+-------------+-----------+----------+-------------+-----------+--+
| a.month | a.username | a.salsum | b.month | b.username | b.salsum |
+----------+-------------+-----------+----------+-------------+-----------+--+
取这一个作为一组
| 2015-01 | A | 33 | 2015-01 | A | 33 |

| 2015-01 | A | 33 | 2015-02 | A | 10 |
| 2015-01 | A | 33 | 2015-03 | A | 16 |
取这两个作为一组
| 2015-02 | A | 10 | 2015-01 | A | 33 |
| 2015-02 | A | 10 | 2015-02 | A | 10 |

| 2015-02 | A | 10 | 2015-03 | A | 16 |
取这三个作为一组
| 2015-03 | A | 16 | 2015-01 | A | 33 |
| 2015-03 | A | 16 | 2015-02 | A | 10 |
| 2015-03 | A | 16 | 2015-03 | A | 16 |

| 2015-01 | B | 30 | 2015-01 | B | 30 |
| 2015-01 | B | 30 | 2015-02 | B | 15 |
| 2015-01 | B | 30 | 2015-03 | B | 17 |
| 2015-02 | B | 15 | 2015-01 | B | 30 |
| 2015-02 | B | 15 | 2015-02 | B | 15 |
| 2015-02 | B | 15 | 2015-03 | B | 17 |
| 2015-03 | B | 17 | 2015-01 | B | 30 |
| 2015-03 | B | 17 | 2015-02 | B | 15 |
| 2015-03 | B | 17 | 2015-03 | B | 17 |
+----------+-------------+-----------+----------+-------------+-----------+--+

每一步相对于上一步的结果

加参数继续变形
select
A.* ,B.*
from
(select t.month,t.username,sum(salary) as salSum
from t_salary_detail t
group by t.username,t.month) A
inner join
(select t.month,t.username,sum(salary) as salSum
from t_salary_detail t
group by t.username,t.month) B
on A.username = B.username
where B.month <= A.month;

+----------+-------------+-----------+----------+-------------+-----------+--+
| a.month | a.username | a.salsum | b.month | b.username | b.salsum |
+----------+-------------+-----------+----------+-------------+-----------+--+
| 2015-01 | A | 33 | 2015-01 | A | 33 | 33

| 2015-02 | A | 10 | 2015-01 | A | 33 | 43
| 2015-02 | A | 10 | 2015-02 | A | 10 |

| 2015-03 | A | 16 | 2015-01 | A | 33 | 59
| 2015-03 | A | 16 | 2015-02 | A | 10 |
| 2015-03 | A | 16 | 2015-03 | A | 16 |

| 2015-01 | B | 30 | 2015-01 | B | 30 | 30

| 2015-02 | B | 15 | 2015-01 | B | 30 | 45
| 2015-02 | B | 15 | 2015-02 | B | 15 |

| 2015-03 | B | 17 | 2015-01 | B | 30 | 62
| 2015-03 | B | 17 | 2015-02 | B | 15 |
| 2015-03 | B | 17 | 2015-03 | B | 17 |
+----------+-------------+-----------+----------+-------------+-----------+--+

第三步:从第二步的结果中继续通过a.month与a.username进行分组,并对分组后的b.salsum进行累加求和即可

select
A.username,A.month,max(A.salSum),sum(B.salSum) as accumulate
from
(select t.month,t.username,sum(salary) as salSum from t_salary_detail t group by t.username,t.month) A
inner join
(select t.month,t.username,sum(salary) as salSum from t_salary_detail t group by t.username,t.month) B
on A.username = B.username
where B.month <= A.month
group by A.username,A.month
order by A.username,A.month;

累计的小费进行求和
+-------------+----------+------+-------------+--+
| a.username | a.month | _c2 | accumulate |
+-------------+----------+------+-------------+--+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| A | 2015-03 | 16 | 59 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
| B | 2015-03 | 17 | 62 |
+-------------+----------+------+-------------+--+

二、路径转换(漏斗模型)

0: jdbc:hive2://node03:10000> select * from ods_click_pageviews limit 10;
+---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+----------------------------------------------------+----------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+
| ods_click_pageviews.session | ods_click_pageviews.remote_addr | ods_click_pageviews.remote_user | ods_click_pageviews.time_local | ods_click_pageviews.request | ods_click_pageviews.visit_step | ods_click_pageviews.page_staylong | ods_click_pageviews.http_referer | ods_click_pageviews.http_user_agent | ods_click_pageviews.body_bytes_sent | ods_click_pageviews.status | ods_click_pageviews.datestr |
+---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+----------------------------------------------------+----------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+
| 9ff03c4c-25f4-43fc-b3e0-08ea88b38fcc | 1.80.249.223 | - | 2013-09-18 07:57:33 | /hadoop-hive-intro/ | 1 | 60 | "http://www.google.com.hk/url?sa=t&rct=j&q=hive%E7%9A%84%E5%AE%89%E8%A3%85&source=web&cd=2&ved=0CC4QFjAB&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%68%61%64%6f%6f%70%2d%68%69%76%65%2d%69%6e%74%72%6f%2f&ei=5lw5Uo-2NpGZiQfCwoG4BA&usg=AFQjCNF8EFxPuCMrm7CvqVgzcBUzrJZStQ&bvm=bv.52164340,d.aGc&cad=rjt" | "Mozilla/5.0(WindowsNT5.2;rv:23.0)Gecko/20100101Firefox/23.0" | 14764 | 200 | 20130918 |
| ba46b150-ca63-47d1-9cdd-e278df01f5d3 | 101.226.167.201 | - | 2013-09-18 09:30:36 | /hadoop-mahout-roadmap/ | 1 | 60 | "http://blog.fens.me/hadoop-mahout-roadmap/" | "Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.1;Trident/4.0;SLCC2;.NETCLR2.0.50727;.NETCLR3.5.30729;.NETCLR3.0.30729;MediaCenterPC6.0;MDDR;.NET4.0C;.NET4.0E;.NETCLR1.1.4322;TabletPC2.0);360Spider" | 10335 | 200 | 20130918 |
| 30e183c4-e03d-4a5a-b5ba-55fff2ea1be1 | 101.226.167.205 | - | 2013-09-18 09:30:32 | /hadoop-family-roadmap/ | 1 | 60 | "http://blog.fens.me/hadoop-family-roadmap/" | "Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.1;Trident/4.0;SLCC2;.NETCLR2.0.50727;.NETCLR3.5.30729;.NETCLR3.0.30729;MediaCenterPC6.0;MDDR;.NET4.0C;.NET4.0E;.NETCLR1.1.4322;TabletPC2.0);360Spider" | 11715 | 200 | 20130918 |
| ea77f279-451d-4efa-8a7f-3d321675ad4d | 101.226.169.215 | - | 2013-09-18 10:07:31 | /about | 1 | 60 | "http://blog.fens.me/about" | "Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.1;Trident/4.0;SLCC2;.NETCLR2.0.50727;.NETCLR3.5.30729;.NETCLR3.0.30729;MediaCenterPC6.0;MDDR;.NET4.0C;.NET4.0E;.NETCLR1.1.4322;TabletPC2.0);360Spider" | 5 | 301 | 20130918 |
| bbe902cb-9496-46fe-b201-2065996373c3 | 110.211.10.14 | - | 2013-09-18 13:31:10 | /hadoop-mahout-roadmap/ | 1 | 60 | "http://f.dataguru.cn/forum.php?mod=viewthread&tid=175501" | "Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.1;WOW64;Trident/4.0;SLCC2;.NETCLR2.0.50727;.NETCLR3.5.30729;.NETCLR3.0.30729;MALN;InfoPath.2;.NET4.0C;MediaCenterPC6.0)" | 10335 | 200 | 20130918 |
| 1646b21a-a2d6-40ef-ab7a-56496ba0e493 | 111.161.17.104 | - | 2013-09-18 12:17:25 | /hadoop-hive-intro/ | 1 | 60 | "http://blog.fens.me/series-hadoop-cloud/" | "Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36" | 14763 | 200 | 20130918 |
| d8261f93-be31-45a9-82d9-094657157468 | 111.193.224.9 | - | 2013-09-18 07:17:25 | /hadoop-family-roadmap/ | 1 | 60 | "https://www.google.com.hk/" | "Mozilla/5.0(Macintosh;IntelMacOSX10_8_5)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.57Safari/537.36" | 11715 | 200 | 20130918 |
| a854f511-3ca9-406b-8a34-5c9cc6a61782 | 112.65.193.16 | - | 2013-09-18 08:48:31 | /hadoop-mahout-roadmap/ | 1 | 60 | "-" | "Mozilla/4.0" | 38590 | 200 | 20130918 |
| 01c87dee-e91c-4ad7-b3cc-0c121bc03806 | 113.107.237.31 | - | 2013-09-18 09:06:46 | /finance-rhive-repurchase/ | 1 | 60 | "-" | "-" | 45271 | 200 | 20130918 |
| f8660978-9b1e-414b-a075-c04f1cb7197a | 113.90.232.163 | - | 2013-09-19 00:58:00 | /hadoop-mahout-roadmap/ | 1 | 60 | "http://h2w.iask.cn/jump.php?url=http%3A%2F%2Fblog.fens.me%2Fhadoop-mahout-roadmap%2F" | "Mozilla/5.0(iPhone;CPUiPhoneOS6_0_1likeMacOSX)AppleWebKit/536.26(KHTML,likeGecko)Mobile/10A523" | 10321 | 200 | 20130918 |
+---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+----------------------------------------------------+----------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+

求两个指标:
第一个指标:每一步相对于第一步的转化率
第二个指标:每一步相对于上一步的转化率

# 使用模型生成的数据,可以满足我们的转化率的求取
load data inpath ‘/weblog/clickstream/pageviews/click-part-r-00000‘ overwrite into table ods_click_pageviews partition(datestr=‘20130920‘);
load data local inpath ‘/export/servers/hivedatas/click-part-r-00000‘ overwrite into table ods_click_pageviews partition(datestr=‘20130920‘);

----------------------------------------------------------
---1、查询每一个步骤的总访问人数

Step1、 /item 1000 相对上一步 相对第一步 1000
Step2、 /category 800 0.8 0.8 1800
Step3、 /index 500 0.625 0.5 2300
Step4、 /order 100 0.2 0.1 2400

create table dw_oute_numbs as
select ‘step1‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews
where datestr=‘20130920‘
and request like ‘/item%‘
union all
select ‘step2‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews
where datestr=‘20130920‘
and request like ‘/category%‘
union all
select ‘step3‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr=‘20130920‘
and request like ‘/order%‘
union all
select ‘step4‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr=‘20130920‘
and request like ‘/index%‘;

+---------------------+----------------------+--+
| dw_oute_numbs.step | dw_oute_numbs.numbs |
+---------------------+----------------------+--+
| step1 | 1029 |
| step2 | 1029 |
| step3 | 1028 |
| step4 | 1018 |
+---------------------+----------------------+--+

----------------------------------------------------------------------------
--2、查询每一步骤相对于路径起点人数的比例
--级联查询,自己跟自己join

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr;

自join后结果如下图所示:
每一步相对于第一步的转化率

+---------+----------+---------+----------+--+
| rnstep | rnnumbs | rrstep | rrnumbs |
+---------+----------+---------+----------+--+
| step1 | 1029 | step1 | 1029 |
| step2 | 1029 | step1 | 1029 |
| step3 | 1028 | step1 | 1029 |
| step4 | 1018 | step1 | 1029 |
| step1 | 1029 | step2 | 1029 |
| step2 | 1029 | step2 | 1029 |
| step3 | 1028 | step2 | 1029 |
| step4 | 1018 | step2 | 1029 |
| step1 | 1029 | step3 | 1028 |
| step2 | 1029 | step3 | 1028 |
| step3 | 1028 | step3 | 1028 |
| step4 | 1018 | step3 | 1028 |
| step1 | 1029 | step4 | 1018 |
| step2 | 1029 | step4 | 1018 |
| step3 | 1028 | step4 | 1018 |
| step4 | 1018 | step4 | 1018 |
+---------+----------+---------+----------+--+

过滤只取step1的所有的数据
select tempTab.rnnumbs/tempTab.rrnumbs from (
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr where rr.step = ‘step1‘
) tempTab;

--每一步的人数/第一步的人数==每一步相对起点人数比例
select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
from(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep=‘step1‘;

简化sql语句;
select a.step,a.numbs/b.numbs compareFirst from dw_oute_numbs a,dw_oute_numbs b
where b.step=‘step1‘
order by a.step;
+---------+---------------------+--+
| a.step | comparefirst |
+---------+---------------------+--+
| step1 | 1.0 |
| step2 | 1.0 |
| step3 | 0.9990281827016521 |
| step4 | 0.989310009718173 |
+---------+---------------------+--+

--------------------------------------------------------------------------------
--3、查询每一步骤相对于上一步骤的漏出率
--首先通过自join表过滤出每一步跟上一步的记录

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr
where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;

select newTable.rnnumbs/newTable.rrnumbs from (
select * from (
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr
) tmpTable
where cast(substr(tmpTable.rrStep,5,1) as int ) = cast(substr(tmpTable.rnstep,5,1) as int )-1
) newTable

where temTable.rrstep.截串 >= temTable.rnstep.截串

注意:cast为hive的内置函数,主要用于类型的转换
用例:
select cast(1 as float);
select cast(‘2018-06-22‘ as date);

+---------+----------+---------+----------+--+
| rnstep | rnnumbs | rrstep | rrnumbs |
+---------+----------+---------+----------+--+
| step1 | 1029 | step2 | 1029 |
| step2 | 1029 | step3 | 1028 |
| step3 | 1028 | step4 | 1018 |
+---------+----------+---------+----------+--+

--然后就可以非常简单的计算出每一步相对上一步的漏出率
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr
) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;

我的思路:
-- cast(‘123‘ AS FLOAT);字符串显示转换为浮点数
select a.step,a.numbs,b.step,b.numbs,case when a.numbs is null then ‘相对上一步‘ else b.numbs/a.numbs end compareLast
from dw_oute_numbs a
right outer join dw_oute_numbs b
on regexp_replace(b.step,‘step‘,‘‘)=regexp_replace(a.step,‘step‘,‘‘)+1 --regexp_replace函数改为substr(b.step,5,1)=substr(a.step,5,1)+1,不用正则表达式,可以提高效率
order by b.step;
+---------+----------+---------+----------+---------------------+--+
| a.step | a.numbs | b.step | b.numbs | comparelast |
+---------+----------+---------+----------+---------------------+--+
| NULL | NULL | step1 | 1029 | 相对前一步 |
| step1 | 1029 | step2 | 1029 | 1.0 |
| step2 | 1029 | step3 | 1028 | 0.9990281827016521 |
| step3 | 1028 | step4 | 1018 | 0.9902723735408561 |
+---------+----------+---------+----------+---------------------+--+

-----------------------------------------------------------------------------------
--4、汇总以上两种指标
select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
from
(
select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep=‘step1‘
) abs
left outer join
(
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step;

原文地址:https://www.cnblogs.com/mediocreWorld/p/11108484.html

时间: 2024-09-29 02:34:32

第2节 网站点击流项目(下):7、hive的级联求和的相关文章

第2节 网站点击流项目(下):2、明细宽表的生成

1. 本项目中数据仓库的设计 注:采用星型模型    1.1. 事实表设计 原始数据表: ods_weblog_origin =>对应mr清洗完之后的数据 valid string 是否有效 remote_addr string 访客ip remote_user string 访客用户信息 time_local string 请求时间 request string 请求url status string 响应码 body_bytes_sent string 响应字节数 http_referer

第2节 网站点击流项目(下):5、访客分析

-- 独立访客--需求:按照时间维度来统计独立访客及其产生的pv量 按照时间维度比如小时来统计独立访客及其产生的 pv . 时间维度:时drop table dw_user_dstc_ip_h;create table dw_user_dstc_ip_h(remote_addr string,pvs bigint,hour string); insert into table dw_user_dstc_ip_h select remote_addr,count(1) as pvs,concat(

第2节 网站点击流项目(下):4、受访分析

2. 受访分析(从页面的角度分析) select * from ods_click_pageviews limit 2;+---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+--------------

网站点击流数据分析项目-

1:网站点击流数据分析项目推荐书籍: 可以看看百度如何实现这个功能的:https://tongji.baidu.com/web/welcome/login 1 网站点击流数据分析,业务知识,推荐书籍: 2 <网站分析实战——如何以数据驱动决策,提升网站价值>王彦平,吴盛锋编著 http://download.csdn.net/download/biexiansheng/10160197 2:整体技术流程及架构: 2.1 数据处理流程    该项目是一个纯粹的数据分析项目,其整体流程基本上就是依

02.网站点击流数据分析项目_模块开发_数据采集

3 模块开发--数据采集 3.1 需求 数据采集的需求广义上来说分为两大部分. 1)是在页面采集用户的访问行为,具体开发工作: 1.开发页面埋点js,采集用户访问行为 2.后台接受页面js请求记录日志 此部分工作也可以归属为"数据源",其开发工作通常由web开发团队负责 2)是从web服务器上汇聚日志到HDFS,是数据分析系统的数据采集,此部分工作由数据分析平台建设团队负责, 具体的技术实现有很多方式: Shell脚本:优点:轻量级,开发简单:缺点:对日志采集过程中的容错处理不便控制

大数据学习——SparkStreaming整合Kafka完成网站点击流实时统计

1.安装并配置zk 2.安装并配置Kafka 3.启动zk 4.启动Kafka 5.创建topic [[email protected] kafka]# bin/kafka-console-producer.sh --broker-list mini1:9092 --topic cyf-test 程序代码 package org.apache.spark import java.net.InetSocketAddress import org.apache.spark.HashPartition

点击流数据(Click Stream Data)及其应用

点击流(Click Stream)是指用户在网站上持续访问的轨迹.众所周知,用户对网站的每次访问包含了一系列的点击动作行为,这些点击行为数据就构成了点击流数据(Click Stream Data),它代表了用户浏览网站的整个流程.目前点击流数据的获取方法有很多,例如通过JS进行事件捕获.发布客户端应用进行采集.网站日志分析等等,本文仅以网站日志分析为例进行阐述. 点击流和网站日志是两个不同的概念,点击流是从用户的角度出发,注重用户浏览网站的整个流程:而网站日志是面向整个站点,它包含了用户行为数据

点击流日志分析

课程介绍 课程名称: 1.什么是点击流系统?记录用户在网站上的操作,用户行为轨迹. 2.日志有哪些需要注意的地方,如何采集日志(flume),日志格式,日志包含的信息量(字段) 3.分析什么? 网址来源,TOPK 客户端流量占比 Android.IOS...... 网页热力图 课程目标: 1. 掌握点击流系统的架构及工作原理 2. 掌握点击点击流中常见的字段及其业务含义 3. 掌握点击流分析系统开发 课程大纲: 1. 背景知识 2. 需求分析 3. 架构设计 4. Storm程序开发 5. 同步

Python 利用 BeautifulSoup 爬取网站获取新闻流

0. 引言 介绍下 Python 用 Beautiful Soup 周期性爬取 xxx 网站获取新闻流: 图 1 项目介绍 1. 开发环境 Python: 3.6.3 BeautifulSoup:   4.2.0 , 是一个可以从HTML或XML文件中提取数据的Python库* ( BeautifulSoup 的中文官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ ) 2. 代码介绍 实现主要分为三个模块: 1. 计时