MapReduce基础入门（二） / 憋错料

1.Combiner合并

2.序列化

3.Sort排序

4.Partitioner分区

5.综合案例

1.Combiner合并

1）什么是Combiner

Combiner 是 MapReduce 程序中 Mapper 和 Reducer 之外的一种组件，它的作用是在 maptask 之后给 maptask 的结果进行局部汇总，以减轻 reducetask 的计算负载，减少网络传输

2）如何使用combiner

Combiner 和 Reducer 一样，编写一个类，然后继承 Reducer， reduce 方法中写具体的 Combiner 逻辑，然后在 job 中设置 Combiner 类： job.setCombinerClass(FlowSumCombine.class)

3）使用combiner注意事项

（1）Combiner 和 Reducer 的区别在于运行的位置：

Combiner 是在每一个 maptask 所在的节点运行
Reducer 是接收全局所有 Mapper 的输出结果

（2）Combiner 的输出 kv 应该跟 reducer 的输入 kv 类型要对应起来

（3）Combiner 的使用要非常谨慎，因为 Combiner 在 MapReduce 过程中可能调用也可能不调用，可能调一次也可能调多次，所以： Combiner 使用的原则是：有或没有都不能影响业务逻辑，都不能影响最终结果(求平均值时，combiner和reduce逻辑不一样)

2.序列化

1）概述

Java 的序列化是一个重量级序列化框架（ Serializable），一个对象被序列化后，会附带很多额外的信息（各种校验信息， header，继承体系等），不便于在网络中高效传输；所以， hadoop 自己开发了一套序列化机制（ Writable），精简，高效，Hadoop 中的序列化框架已经对基本类型和 null 提供了序列化的实现了，如下

byte	ByteWritable
short	ShortWritable
int	IntWritable
long	LongWritable
float	FloatWritable
double	DoubleWritable
String	Text
null	NullWritable

2)Java序列化

Java序列号只需要实现Serializable接口即可如

  1 public class Student implements Serializable{
  2      public Student(){
  3     }
  4     private int id;
  5 }

3）自定义对象实现mapreduce框架序列化

如果需要将自定义的 bean 放在 key 中传输，则还需要实现 Comparable 接口，因为 mapreduce框中的 shuffle 过程一定会对 key 进行排序，此时，自定义的 bean 实现的接口应该是：

  1 public class FlowBean implements WritableComparable<FlowBean> {
  2
  3 	private String phoneNbr;
  4 	private long up_flow;
  5 	private long d_flow;
  6 	private long sum_flow;
  7
  8 	public void set(String phoneNbr, long up_flow, long d_flow) {
  9 		this.phoneNbr = phoneNbr;
 10 		this.up_flow = up_flow;
 11 		this.d_flow = d_flow;
 12 		this.sum_flow = up_flow + d_flow;
 13 	}
 14
 15 	/**
 16 	 * 序列化，将数据字段以字节流写出去
 17 	 */
 18 	public void write(DataOutput out) throws IOException {
 19 		out.writeUTF(this.phoneNbr);
 20 		out.writeLong(this.up_flow);
 21 		out.writeLong(this.d_flow);
 22 		out.writeLong(this.sum_flow);
 23 	}
 24
 25 	/**
 26 	 * 反序列化，从字节流中读出各个数据字段 读出的顺序应该跟序列化时写入的顺序保持一致
 27 	 */
 28 	public void readFields(DataInput in) throws IOException {
 29 		this.phoneNbr = in.readUTF();
 30 		this.up_flow = in.readLong();
 31 		this.d_flow = in.readLong();
 32 		this.sum_flow = in.readLong();
 33 	}
 34
 35 	public int compareTo(FlowBean o) {
 36 		return this.sum_flow > o.getSum_flow() ? -1 : 1;
 37 	}
 38
 39 	@Override
 40 	public String toString() {
 41 		return phoneNbr+"\t"+up_flow + "\t" + d_flow + "\t" + sum_flow;
 42 	}
 43
 44 	public String getPhoneNbr() {
 45 		return phoneNbr;
 46 	}
 47
 48 	public void setPhoneNbr(String phoneNbr) {
 49 		this.phoneNbr = phoneNbr;
 50 	}
 51
 52 	public long getUp_flow() {
 53 		return up_flow;
 54 	}
 55
 56 	public void setUp_flow(long up_flow) {
 57 		this.up_flow = up_flow;
 58 	}
 59
 60 	public long getD_flow() {
 61 		return d_flow;
 62 	}
 63
 64 	public void setD_flow(long d_flow) {
 65 		this.d_flow = d_flow;
 66 	}
 67
 68 	public long getSum_flow() {
 69 		return sum_flow;
 70 	}
 71
 72 	public void setSum_flow(long sum_flow) {
 73 		this.sum_flow = sum_flow;
 74 	}
 75
 76 }

3.Sort排序

在MapReduce执行Spill溢出过程中，会对结果进行排序，排序的方法是，根据key进行排序，如果我们把自己的FlowBean当作key，并且实现了WritableComparable，只需要重写compareTo方法即可，如，按照sun_flow倒叙排序，重新的compareTo方法如下：

  1 public int compareTo(FlowBean o) {
  2 		return this.sum_flow > o.getSum_flow() ? -1 : 1;
  3 	}

4.Partitioner分区

在MapReduce分区的过程中，Reduce的个数即分区的个数，这个时候只需要设置job.setNumReduceTasks(5)，就可以设置Reduce的个数了，但这个分区是系统随便分区的，要想实现自己的分区，则还需要继承extends Partitioner<KEY, VALUE>类，并且重写getPartition即可

  1 public int getPartition(KEY key, VALUE value, int numPartitions)

5.综合案例

测试数据：

部分数据如下：

1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200

1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200

需求分析：

（1）统计每一个用户（手机号）所耗费的总上行流量、下行流量、总流量

（2）得出上体的结果在加一个需求：将统计结果按照总流量倒叙排序

（3）将流量汇总结果按照手机归属地输出到不同的文件中

实现如下：

FLowBean.java

  1 import java.io.DataInput;
  2 import java.io.DataOutput;
  3 import java.io.IOException;
  4
  5 import org.apache.hadoop.io.WritableComparable;
  6
  7 public class FlowBean implements WritableComparable<FlowBean> {
  8
  9 	private String phoneNbr;
 10 	private long up_flow;
 11 	private long d_flow;
 12 	private long sum_flow;
 13
 14 	public void set(String phoneNbr, long up_flow, long d_flow) {
 15 		this.phoneNbr = phoneNbr;
 16 		this.up_flow = up_flow;
 17 		this.d_flow = d_flow;
 18 		this.sum_flow = up_flow + d_flow;
 19 	}
 20
 21 	/**
 22 	 * 序列化，将数据字段以字节流写出去
 23 	 */
 24 	public void write(DataOutput out) throws IOException {
 25 		out.writeUTF(this.phoneNbr);
 26 		out.writeLong(this.up_flow);
 27 		out.writeLong(this.d_flow);
 28 		out.writeLong(this.sum_flow);
 29 	}
 30
 31 	/**
 32 	 * 反序列化，从字节流中读出各个数据字段 读出的顺序应该跟序列化时写入的顺序保持一致
 33 	 */
 34 	public void readFields(DataInput in) throws IOException {
 35 		this.phoneNbr = in.readUTF();
 36 		this.up_flow = in.readLong();
 37 		this.d_flow = in.readLong();
 38 		this.sum_flow = in.readLong();
 39 	}
 40
 41 	public int compareTo(FlowBean o) {
 42 		return this.sum_flow > o.getSum_flow() ? -1 : 1;
 43 	}
 44
 45 	@Override
 46 	public String toString() {
 47 		return phoneNbr+"\t"+up_flow + "\t" + d_flow + "\t" + sum_flow;
 48 	}
 49
 50 	public String getPhoneNbr() {
 51 		return phoneNbr;
 52 	}
 53
 54 	public void setPhoneNbr(String phoneNbr) {
 55 		this.phoneNbr = phoneNbr;
 56 	}
 57
 58 	public long getUp_flow() {
 59 		return up_flow;
 60 	}
 61
 62 	public void setUp_flow(long up_flow) {
 63 		this.up_flow = up_flow;
 64 	}
 65
 66 	public long getD_flow() {
 67 		return d_flow;
 68 	}
 69
 70 	public void setD_flow(long d_flow) {
 71 		this.d_flow = d_flow;
 72 	}
 73
 74 	public long getSum_flow() {
 75 		return sum_flow;
 76 	}
 77
 78 	public void setSum_flow(long sum_flow) {
 79 		this.sum_flow = sum_flow;
 80 	}
 81
 82 }

PhonePartitioner.java

  1 import java.util.HashMap;
  2 import org.apache.hadoop.mapreduce.Partitioner;
  3
  4 public class PhonePartitioner<KEY, VALUE> extends Partitioner<KEY, VALUE>{
  5
  6 	private static HashMap<String, Integer> areaMap =  new HashMap<>();
  7
  8 	static {
  9 		areaMap.put("136", 0);
 10 		areaMap.put("137", 1);
 11 		areaMap.put("138", 2);
 12 		areaMap.put("139", 3);
 13 	}
 14
 15 	@Override
 16 	public int getPartition(KEY key, VALUE value, int numPartitions) {
 17 		Integer provinceCode = areaMap.get(key.toString().substring(0,3));
 18 		return provinceCode==null?4:provinceCode;
 19 	}
 20 }

FlowCount.java

  1 import java.io.IOException;
  2 import java.net.URI;
  3 import org.apache.commons.lang.StringUtils;
  4 import org.apache.hadoop.conf.Configuration;
  5 import org.apache.hadoop.fs.FileSystem;
  6 import org.apache.hadoop.fs.Path;
  7 import org.apache.hadoop.io.LongWritable;
  8 import org.apache.hadoop.io.Text;
  9 import org.apache.hadoop.mapreduce.Job;
 10 import org.apache.hadoop.mapreduce.Mapper;
 11 import org.apache.hadoop.mapreduce.Reducer;
 12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 13 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 15 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 16
 17
 18 //统计手机号总流，分区并且统计流量总数
 19 public class FlowCount {
 20
 21 	public static class FlowCountMapper extends
 22 			Mapper<LongWritable, Text, Text, FlowBean> {
 23 		private FlowBean flowBean = new FlowBean();
 24
 25 		@Override
 26 		protected void map(LongWritable key, Text value, Context context)
 27 				throws IOException, InterruptedException {
 28
 29 			String line = value.toString();
 30 			if (StringUtils.isBlank(line)) {
 31 				return;
 32 			}
 33
 34 			// 切分字段
 35 			String[] fields = line.split("\\s+");
 36
 37 			// 拿到我们需要的若干个字段
 38 			String phoneNbr = fields[1];
 39 			long up_flow = Long.parseLong(fields[fields.length - 3]);
 40 			long d_flow = Long.parseLong(fields[fields.length - 2]);
 41 			// 将数据封装到一个flowbean中
 42 			flowBean.set(phoneNbr, up_flow, d_flow);
 43 			context.write(new Text(phoneNbr), flowBean);
 44 		}
 45 	}
 46
 47 	public static class FlowCountReducer extends
 48 			Reducer<Text, FlowBean, Text, FlowBean> {
 49 		private FlowBean flowBean = new FlowBean();
 50
 51 		@Override
 52 		protected void reduce(Text key, Iterable<FlowBean> values,
 53 				Context context) throws IOException, InterruptedException {
 54 			long up_flow_sum = 0;
 55 			long d_flow_sum = 0;
 56 			for (FlowBean flowBean : values) {
 57 				up_flow_sum += flowBean.getUp_flow();
 58 				d_flow_sum += flowBean.getD_flow();
 59 			}
 60 			flowBean.set(key.toString(), up_flow_sum, d_flow_sum);
 61 			context.write(key, flowBean);
 62 		}
 63 	}
 64
 65 	final static String INPUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/input/phonedata.txt";
 66 	final static String OUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/output";
 67
 68 	public static void main(String[] args) throws Exception {
 69 		Configuration conf = new Configuration();
 70
 71 		final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);
 72 		if (fileSystem.exists(new Path(OUT_PATH))) {
 73 			fileSystem.delete(new Path(OUT_PATH), true);
 74 		}
 75
 76 		Job job = Job.getInstance(conf, "flowjob");
 77 		job.setJarByClass(FlowCount.class);
 78
 79 		job.setMapperClass(FlowCountMapper.class);
 80 		job.setReducerClass(FlowCountReducer.class);
 81
 82 		job.setPartitionerClass(PhonePartitioner.class);
 83 		job.setNumReduceTasks(5);
 84
 85 		job.setMapOutputKeyClass(Text.class);
 86 		job.setMapOutputValueClass(FlowBean.class);
 87
 88 		job.setOutputKeyClass(Text.class);
 89 		job.setOutputValueClass(FlowBean.class);
 90
 91 		job.setInputFormatClass(TextInputFormat.class);
 92 		job.setOutputFormatClass(TextOutputFormat.class);
 93
 94 		FileInputFormat.setInputPaths(job, new Path(INPUT_PATH));
 95 		FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
 96
 97 		job.waitForCompletion(true);
 98 	}
 99 }

FlowCountSort.java

  1 import java.io.IOException;
  2 import java.net.URI;
  3 import org.apache.commons.lang.StringUtils;
  4 import org.apache.hadoop.conf.Configuration;
  5 import org.apache.hadoop.fs.FileSystem;
  6 import org.apache.hadoop.fs.Path;
  7 import org.apache.hadoop.io.LongWritable;
  8 import org.apache.hadoop.io.NullWritable;
  9 import org.apache.hadoop.io.Text;
 10 import org.apache.hadoop.mapreduce.Job;
 11 import org.apache.hadoop.mapreduce.Mapper;
 12 import org.apache.hadoop.mapreduce.Reducer;
 13 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 14 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 15 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 16 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 17
 18
 19 //按照总流量数从大到小排序
 20 public class FlowCountSort {
 21
 22 	public static class FlowCountSortMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable>{
 23 		private FlowBean bean = new FlowBean();
 24 		@Override
 25 		protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
 26 			String line = value.toString();
 27
 28 			if (StringUtils.isBlank(line)) {
 29 				return;
 30 			}
 31
 32 			String[] fields = StringUtils.split(line, "\t");
 33 			String phoneNbr = fields[1];
 34 			long up_flow = Long.parseLong(fields[fields.length - 3]);
 35 			long d_flow = Long.parseLong(fields[fields.length - 2]);
 36 			bean.set(phoneNbr, up_flow, d_flow);
 37
 38 			context.write(bean, NullWritable.get());
 39 		}
 40 	}
 41
 42 	public static class FlowCountSortReducer extends Reducer<FlowBean, NullWritable, Text, FlowBean>{
 43 		@Override
 44 		protected void reduce(FlowBean bean, Iterable<NullWritable> values,Context context) throws IOException, InterruptedException {
 45 			context.write(new Text(bean.getPhoneNbr()), bean);
 46 		}
 47 	}
 48
 49 	final static String INPUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/input/phonedata.txt";
 50 	final static String OUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/output";
 51
 52
 53 	public static void main(String[] args) throws Exception {
 54
 55 		Configuration conf = new Configuration();
 56
 57 		final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);
 58 		if (fileSystem.exists(new Path(OUT_PATH))) {
 59 			fileSystem.delete(new Path(OUT_PATH), true);
 60 		}
 61
 62 		Job job = Job.getInstance(conf,"flowjob");
 63 		job.setJarByClass(FlowCountSort.class);
 64
 65 		job.setMapperClass(FlowCountSortMapper.class);
 66 		job.setReducerClass(FlowCountSortReducer.class);
 67
 68 		job.setMapOutputKeyClass(FlowBean.class);
 69 		job.setMapOutputValueClass(NullWritable.class);
 70
 71 		job.setOutputKeyClass(Text.class);
 72 		job.setOutputValueClass(FlowBean.class);
 73
 74 		job.setInputFormatClass(TextInputFormat.class);
 75 		job.setOutputFormatClass(TextOutputFormat.class);
 76
 77 		FileInputFormat.setInputPaths(job, new Path(INPUT_PATH));
 78 		FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
 79
 80 		job.waitForCompletion(true);
 81 	}
 82 }

原文地址：https://www.cnblogs.com/mingyueguli/p/10492137.html

时间： 2024-08-02 06:29:58

MapReduce基础入门（二）

MapReduce基础入门（二）的相关文章

C#基础入门二

MapReduce(一) mapreduce基础入门

Python基础入门 (二)

前端基础入门二（CSS）

Spring MVC 基础入门二

React.js 基础入门二--组件嵌套

C#学习笔记---基础入门(二)

Maven基础入门(二)

[Spring框架]Spring AOP基础入门总结一.