1. def cume_dist(): Column
–CUME_DIST 小于等于当前值的行数/分组内总行数
–比如,统计小于等于当前薪水的人数,所占总人数的比例
d1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000 df.withColumn("rn1",cume_dist().over(Window.partitionBy(col("dept")).orderBy(col("sal")))).show() dept userid sal rn2 ------------------------------------------- d1 user1 1000 0.3333333333333333 d1 user2 2000 0.6666666666666666 d1 user3 3000 1.0 d2 user4 4000 0.5 d2 user5 5000 1.0 rn1: 按照部门分组,dpet=d1的行数为3, 第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666
2.def percent_rank(): Column
–PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
应用场景不了解,可能在一些特殊算法的实现中可以用到吧。
d1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000 df.withColumn("rn1",percent_rank().over(Window.partitionBy(col("dept")).orderBy(col("sal")))).show() dept userid sal rn2 ----------------------------- d1 user1 1000 0.0 d1 user2 2000 0.5 d1 user3 3000 1.0 d2 user4 4000 0.0 d2 user5 5000 1.0
3.def ntile(n: Int): Column
NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均匀,默认增加第一个切片的分布
cookie1 2015-04-10 1 cookie1 2015-04-11 5 cookie1 2015-04-12 7 cookie1 2015-04-13 3 cookie1 2015-04-14 2 cookie1 2015-04-15 4 cookie1 2015-04-16 4 cookie2 2015-04-10 2 cookie2 2015-04-11 3 cookie2 2015-04-12 5 cookie2 2015-04-13 6 cookie2 2015-04-14 3 cookie2 2015-04-15 9 cookie2 2015-04-16 7 df.withColumn("rn3",ntile(3).over(Window.partitionBy(col("cookieid")).orderBy(col("createtime")))).show() 比如,统计一个cookie,pv数最多的前1/3的天 --rn = 1 的记录,就是我们想要的结果 cookieid day pv rn ---------------------------------- cookie1 2015-04-12 7 1 cookie1 2015-04-11 5 1 cookie1 2015-04-15 4 1 cookie1 2015-04-16 4 2 cookie1 2015-04-13 3 2 cookie1 2015-04-14 2 3 cookie1 2015-04-10 1 3 cookie2 2015-04-15 9 1 cookie2 2015-04-16 7 1 cookie2 2015-04-13 6 1 cookie2 2015-04-12 5 2 cookie2 2015-04-14 3 2 cookie2 2015-04-11 3 3 cookie2 2015-04-10 2 3
4. def row_number(): Column
ROW_NUMBER() –从1开始,按照顺序,生成分组内记录的序列
row_number(): 不考虑数据的重复性 按照顺序一次打上标号 : 如: 1,2,3,4
–比如,按照pv降序排列,生成分组内每天的pv名次
ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。
cookie1 2015-04-10 1 cookie1 2015-04-11 5 cookie1 2015-04-12 7 cookie1 2015-04-13 3 cookie1 2015-04-14 2 cookie1 2015-04-15 4 cookie1 2015-04-16 4 cookie2 2015-04-10 2 cookie2 2015-04-11 3 cookie2 2015-04-12 5 cookie2 2015-04-13 6 cookie2 2015-04-14 3 cookie2 2015-04-15 9 cookie2 2015-04-16 7 df.withColumn("rn3",ROW_NUMBER().over(Window.partitionBy(col("cookieid")).orderBy(col("pv").desc))).show() cookieid day pv rn ------------------------------------------- cookie1 2015-04-12 7 1 cookie1 2015-04-11 5 2 cookie1 2015-04-15 4 3 cookie1 2015-04-16 4 4 cookie1 2015-04-13 3 5 cookie1 2015-04-14 2 6 cookie1 2015-04-10 1 7 cookie2 2015-04-15 9 1 cookie2 2015-04-16 7 2 cookie2 2015-04-13 6 3 cookie2 2015-04-12 5 4 cookie2 2015-04-14 3 5 cookie2 2015-04-11 3 6 cookie2 2015-04-10 2 7
5. def rank(): Column
—RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
rank() :考虑数据的重复性,挤占坑位,如:1,2,2,4
cookie1 2015-04-10 1 cookie1 2015-04-11 5 cookie1 2015-04-12 7 cookie1 2015-04-13 3 cookie1 2015-04-14 2 cookie1 2015-04-15 4 cookie1 2015-04-16 4 cookie2 2015-04-10 2 cookie2 2015-04-11 3 cookie2 2015-04-12 5 cookie2 2015-04-13 6 cookie2 2015-04-14 3 cookie2 2015-04-15 9 cookie2 2015-04-16 7 df.withColumn("rn3",RANK().over(Window.partitionBy(col("cookieid")).orderBy(col("pv").desc))).show() cookieid day pv rn1 -------------------------------------- cookie1 2015-04-12 7 1 cookie1 2015-04-11 5 2 cookie1 2015-04-15 4 3 cookie1 2015-04-16 4 3 cookie1 2015-04-13 3 5 cookie1 2015-04-14 2 6 cookie1 2015-04-10 1 7 rn1: 15号和16号并列第3, 13号排第5
6. def dense_rank() : Column
DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位
dense_rank() : 考虑数据重复性 不挤占坑位,如:1,2,2,3
cookie1 2015-04-10 1 cookie1 2015-04-11 5 cookie1 2015-04-12 7 cookie1 2015-04-13 3 cookie1 2015-04-14 2 cookie1 2015-04-15 4 cookie1 2015-04-16 4 cookie2 2015-04-10 2 cookie2 2015-04-11 3 cookie2 2015-04-12 5 cookie2 2015-04-13 6 cookie2 2015-04-14 3 cookie2 2015-04-15 9 cookie2 2015-04-16 7 df.withColumn("rn1",DENSE_RANK().over(Window.partitionBy(col("cookieid")).orderBy(col("pv").desc))).show() cookieid day pv rn1 -------------------------------------------- cookie1 2015-04-12 7 1 cookie1 2015-04-11 5 2 cookie1 2015-04-15 4 3 cookie1 2015-04-16 4 3 cookie1 2015-04-13 3 4 cookie1 2015-04-14 2 5 cookie1 2015-04-10 1 6 rn2: 15号和16号并列第3, 13号排第4
原文地址:https://www.cnblogs.com/yyy-blog/p/12642840.html