group语句可以把具有相同键值的数据聚合在一起,与SQL中的group操作有着本质的区别,在SQL中group by字句创建的组必须直接注入一个或多个聚合函数。在Pig Latin中group和聚合函数之间没有直接的关系。
group关键字正如它字面所表达的:将包含了特定的键所对应的值的所有记录封装到一个bag中,之后,用户可以将这个结果传递给一个聚合函数或者使用它做其他一些处理。
触发reduce阶段
数据文件内容如下:
[[email protected] ~]$ cat orders.data 1 apple 30 x 2 apple 50 x 3 banana 30 y 4 pear 20 y 5 banana 10 y [[email protected] ~]$
加载数据并分组
data = load ‘/orders.data‘ as (orderid:int, fruit:chararray, amount:int); grpd = group data by fruit;
查看分组后的数据模式
分组后的数据只有两个字段:group(分组字段)、数据(列名是被分组的数据集别名,数据是所有数据组成的bag。
describe grpd; grpd: {group: chararray,data: {(orderid: int,fruit: chararray,amount: int)}}
查看分组数据
dump grpd; (pear,{(4,pear,20)}) (apple,{(2,apple,50),(1,apple,30)}) (banana,{(5,banana,10),(3,banana,30)})
使用聚合函数对分组后的结果集进行处理:
dump grpd; (pear,{(4,pear,20)}) (apple,{(2,apple,50),(1,apple,30)}) (banana,{(5,banana,10),(3,banana,30)})
group data by $0+$1;
对多个键分组
分组后的数据有两个字段,一个是别名是group的tuple,一个是聚合了本组数据的bag
group data by (filed1, field2)
orders = load ‘/orders.data‘ as (orderid:int, fruit:chararray, amount:int, type:chararray); grpd = group orders by (fruit, type); describe grpd; grpd: {group: (fruit: chararray,type: chararray),orders: {(orderid: int,fruit: chararray,amount: int,type: chararray)}} dump grpd; ((pear,y),{(4,pear,20,y)}) ((apple,x),{(2,apple,50,x),(1,apple,30,x)}) ((banana,y),{(5,banana,10,y),(3,banana,30,y)})
sums = foreach grpd generate group, SUM(orders.amount); dump sums; ((pear,y),20) ((apple,x),80) ((banana,y),40)
sums2 = foreach grpd generate group.$0, group.$1, SUM(orders.amount); dump sums2; (pear,y,20) (apple,x,80) (banana,y,40
group all 将数据集的所有数据放到一个分组里
grpd = group orders all; describe grpd; grpd: {group: chararray,orders: {(orderid: int,fruit: chararray,amount: int,type: chararray)}} dump grpd; (all,{(5,banana,10,y),(4,pear,20,y),(3,banana,30,y),(2,apple,50,x),(1,apple,30,x)})
co-group多个数据集group
A = LOAD ‘data1‘ AS (owner:chararray,pet:chararray); DUMP A; (Alice,turtle) (Alice,goldfish) (Alice,cat) (Bob,dog) (Bob,cat) B = LOAD ‘data2‘ AS (friend1:chararray,friend2:chararray); DUMP B; (Cindy,Alice) (Mark,Alice) (Paul,Bob) (Paul,Jane) X = COGROUP A BY owner, B BY friend2; DESCRIBE X; X: {group: chararray,A: {owner: chararray,pet: chararray},B: {friend1: chararray,friend2: chararray}} DUMP X; (Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)}) (Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)}) (Jane,{},{(Paul,Jane)})
partition by parallel n
A = LOAD ‘input_data‘; B = GROUP A BY $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner PARALLEL 2;
SimpleCustomPartitioner:
public class SimpleCustomPartitioner extends Partitioner <PigNullableWritable, Writable> { //@Override public int getPartition(PigNullableWritable key, Writable value, int numPartitions) { if(key.getValueAsPigType() instanceof Integer) { int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions); return ret; } else { return (key.hashCode()) % numPartitions; } } }
NULL值处理
NULL是一个特殊的分组key,所有key是null的tuple都会被聚集到一组里。
时间: 2024-12-06 18:56:35