任务目标:
目标一 : 每名学生被多少位老师教过
方法一 : 先DISTINCT, 在计数
- DISTINCT 能偶对所有数据去重
方法二 : 先分组
- FOREACH 嵌套
- 使用DISTINCT
首先创建一份数据源文件
[[email protected] ~]$ cat score.txt James,Network,Tiger,100 James,Database,Tiger,99 James,PDE,Yao,95 Vincent,Network,Tiger,95 Vincent,PDE,Yao,98 Vincent,PDE, NocWei,PDE,Yao,100 [[email protected] ~]$ hadoop fs -put score.txt
[[email protected] ~]$ pig grunt> A = LOAD ‘/score.txt‘ USING PigStorage(‘,‘) AS (student,course,teacher,score:int); grunt> DESCRIBE A; grunt> B = FOREACH A GENERATE student, teacher; #只提取student和teacher,其他的丢掉 grunt> DESCRIBE B; #查看B数据,会发现只有两个元祖 grunt> C = DISTINCT B; #对B的数据去重 grunt> D = FOREACH ( GROUP C BY student ) GENERATE group AS student , COUNT(C); grunt> DUMP D #结果 (James,2) (NocWei,1) (Vincent,3) grunt>
grunt> E = group B by student; grunt> F = foreach E >> { >> T = B.teacher; >> uniq = DISTINCT T; >> generate group as student,COUNT(uniq) as cnt; >> }
目标二 : 找出每门课程最优秀的两名学生
步骤一: group by
- group by 的嵌套方法
步骤二: order by
- foreach嵌套
步骤三: limit
- 配合order by 使用
步骤四: flantten
- 去括号过程
grunt> A = LOAD ‘/score.txt‘ USING PigStorage(‘,‘) as (student,course,teacher,score:int); grunt> dump A (James,Network,Tiger,100) (James,Database,Tiger,99) (James,PDE,Yao,95) (Vincent,Network,Tiger,95) (Vincent,PDE,Yao,98) (Vincent,PDE,,) (NocWei,PDE,Yao,100) grunt> B = FOREACH A GENERATE student,course,score; grunt> dump B (James,Network,100) (James,Database,99) (James,PDE,95) (Vincent,Network,95) (Vincent,PDE,98) (Vincent,PDE,) (NocWei,PDE,100) grunt> C = group B by course grunt> dump C (PDE,{(NocWei,PDE,100),(Vincent,PDE,),(Vincent,PDE,98),(James,PDE,95)}) (Network,{(Vincent,Network,95),(James,Network,100)}) (Database,{(James,Database,99)}) grunt> D = FOREACH C >> { >> sorted = ORDER B BY score DESC; >> top = LIMIT sorted 2; >> GENERATE group AS course, top AS top; >> } grunt> dump D (Database,{(James,Database,99)}) (Network,{(James,Network,100),(Vincent,Network,95)}) (PDE,{(NocWei,PDE,100),(Vincent,PDE,98)}) grunt> E = FOREACH D GENERATE course,FLATTEN(top); #对输出格式去括号 grunt> dump E (Database,James,Database,99) (Network,James,Network,100) (Network,Vincent,Network,95) (PDE,NocWei,PDE,100) (PDE,Vincent,PDE,98)
时间: 2025-01-04 15:19:22