hadoop编程小技巧（8）---Unit Testing (单元测试)

所需环境：

Hadoop相关jar包（下载官网发行版即可）；

下载junit包（最新为好）；

下载mockito包；

下载mrunit包；

下载powermock-mockito包；

相关包截图如下（相关下载参考：http://download.csdn.net/detail/fansy1990/7690977）：

应用场景：

在进行Hadoop的一般MR编程时，需要验证我们的业务逻辑，或者说是验证数据流的时候可以使用此环境，这个环境不要求真实的云平台，只是针对算法或者代码逻辑进行验证，方便调试代码。

实例：

Mapper：

package fz.mrtest;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class SMSCDRMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

	  private Text status = new Text();
	  private final static IntWritable addOne = new IntWritable(1);

	  /**
	   * Returns the SMS status code and its count
	   */
	  protected void map(LongWritable key, Text value, Context context)
	      throws java.io.IOException, InterruptedException {

	    //655209;1;796764372490213;804422938115889;6 is the Sample record format
	    String[] line = value.toString().split(";");
	    // If record is of SMS CDR
	    if (Integer.parseInt(line[1]) == 1) {
	      status.set(line[4]);
	      context.write(status, addOne);
	    }
	  }
	}

Reducer：

package fz.mrtest;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SMSCDRReducer extends
  Reducer<Text, IntWritable, Text, IntWritable> {

  protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws java.io.IOException, InterruptedException {
    int sum = 0;
    for (IntWritable value : values) {
      sum += value.get();
    }
    context.write(key, new IntWritable(sum));
  }
}

测试主程序：

package fz.mrtest;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.Before;
import org.junit.Test;

public class SMSCDRMapperReducerTest {

  MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
  ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;
  MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable> mapReduceDriver;

  @Before
  public void setUp() {
    SMSCDRMapper mapper = new SMSCDRMapper();
    SMSCDRReducer reducer = new SMSCDRReducer();
    mapDriver = MapDriver.newMapDriver(mapper);;
    reduceDriver = ReduceDriver.newReduceDriver(reducer);
    mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer);
  }

  @Test
  public void testMapper() throws IOException {
    mapDriver.withInput(new LongWritable(), new Text(
        "655209;1;796764372490213;804422938115889;6"));
    mapDriver.withOutput(new Text("6"), new IntWritable(1));
    mapDriver.runTest();
  }

  @Test
  public void testReducer() throws IOException {
    List<IntWritable> values = new ArrayList<IntWritable>();
    values.add(new IntWritable(1));
    values.add(new IntWritable(1));
    reduceDriver.withInput(new Text("6"), values);
    reduceDriver.withOutput(new Text("6"), new IntWritable(2));
    reduceDriver.runTest();
  }
  @Test
  public void testMR() throws IOException{
	  mapReduceDriver.withInput(new LongWritable(), new Text(
        "655209;1;796764372490213;804422938115889;6"));
	  mapReduceDriver.withInput(new LongWritable(), new Text(
		        "6552092;1;796764372490213;804422938115889;6"));
	  mapReduceDriver.withOutput(new Text("6"), new IntWritable(2));
	  mapReduceDriver.runTest();
  }
}

（代码源于MRUnit的官网，最后的测试主程序加了个对整个的测试）测试主程序一共有三个测试方法，分别测试Mapper、Reducer、以及Mapper和Reducer的联合测试。

总结：使用Hadoop的单元测试可以方便验证编写程序的正确性，而不需要使用真实环境验证代码的正确性为高效开发提供了可能。但是针对一些特殊的情况还是需要真实环境测试代码，这点需要特殊考虑，不过一般情况下，此单元测试环境对编写的MR都适用。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

hadoop编程小技巧（8）---Unit Testing (单元测试)

时间： 2024-11-05 23:16:01

hadoop编程小技巧（8）---Unit Testing (单元测试)的相关文章

hadoop编程小技巧（4）---全局key排序类TotalOrderPartitioner

Hadoop代码测试版本:Hadoop2.4 原理:在进行MR程序之前对输入数据进行随机提取样本,把样本排序,然后在MR的中间过程Partition的时候使用这个样本排序的值进行分组数据,这样就可以达到全局排序的目的了. 难点:如果使用Hadoop提供的方法来实现全局排序,那么要求Mapper的输入.输出的key不变才可以,因为在源码InputSampler中提供的随机抽取的数据是输入数据最原始的key,如下代码(line:225): for (int i = 0; i < splitsToSa

hadoop编程小技巧（2）---计数器Counter

Hadoop代码测试版本:2.4 应用场景:在Hadoop编程的时候,有时我们在进行我们算法逻辑的时候想附带了解下数据的一些特性,比如全部数据的记录数有多少,map的输出有多少等等信息(这些是在算法运行完毕后,直接有的),就可以使用计数器Counter. 如果是针对很特定的数据的一些统计,比如统计以1开头的所有记录数等等信息,这时就需要自定义Counter.自定义Counter有两种方式,第一种,定义枚举类型,类似: public enum MyCounters{ ALL_RECORDS,ONE

hadoop编程小技巧（1）---map端聚合

测试hadoop版本:2.4 Map端聚合的应用场景:当我们只关心所有数据中的部分数据时,并且数据可以放入内存中. 使用的好处:可以大大减小网络数据的传输量,提高效率: 一般编程思路:在Mapper的map函数中读入所有数据,然后添加到一个List(队列)中,然后在cleanup函数中对list进行处理,输出我们关系的少量数据. 实例: 在map函数中使用空格分隔每行数据,然后把每个单词添加到一个堆栈中,在cleanup函数中输出堆栈中单词次数比较多的单词以及次数: package fz.inm

hadoop编程小技巧（3）---自定义分区类Partitioner

Hadoop代码测试环境:Hadoop2.4 原理:在Hadoop的MapReduce过程中,Mapper读取处理完成数据后,会把数据发送到Partitioner,由Partitioner来决定每条记录应该送往哪个reducer节点,默认使用的是HashPartitioner,其核心代码如下: /** Use {@link Object#hashCode()} to partition. */ public int getPartition(K2 key, V2 value, int numRe

hadoop编程小技巧（9）---二次排序（值排序）

代码测试环境:Hadoop2.4 应用场景:在Reducer端一般是key排序,而没有value排序,如果想对value进行排序,则可以使用此技巧. 应用实例描述: 比如针对下面的数据: a,5 b,7 c,2 c,9 a,3 a,1 b,10 b,3 c,1 如果使用一般的MR的话,其输出可能是这样的: a 1 a 3 a 5 b 3 b 10 b 7 c 1 c 9 c 2 从数据中可以看到其键是排序的,但是其值不是.通过此篇介绍的技巧可以做到下面的输出: a 1 a 3 a 5 b 3 b

hadoop编程小技巧（6）---处理大量小数据文件CombineFileInputFormat应用

代码测试环境:Hadoop2.4 应用场景:当需要处理很多小数据文件的时候,可以应用此技巧来达到高效处理数据的目的. 原理:应用CombineFileInputFormat,可以把多个小数据文件在进行分片的时候合并.由于每个分片会产生一个Mapper,当一个Mapper处理的数据比较小的时候,其效率较低.而一般使用Hadoop处理数据时,即默认方式,会把一个输入数据文件当做一个分片,这样当输入文件较小时就会出现效率低下的情况. 实例: 参考前篇blog:hadoop编程小技巧(5)---自定义输

hadoop编程小技巧（7）---自定义输出文件格式以及输出到不同目录

代码测试环境:Hadoop2.4 应用场景:当需要定制输出数据格式时可以采用此技巧,包括定制输出数据的展现形式,输出路径,输出文件名称等. Hadoop内置的输出文件格式有: 1)FileOutputFormat<K,V> 常用的父类: 2)TextOutputFormat<K,V> 默认输出字符串输出格式: 3)SequenceFileOutputFormat<K,V> 序列化文件输出: 4)MultipleOutputs<K,V> 可以把输出数据输送到

hadoop编程小技巧（5）---自定义输入文件格式类InputFormat

Hadoop代码测试环境:Hadoop2.4 应用:在对数据需要进行一定条件的过滤和简单处理的时候可以使用自定义输入文件格式类. Hadoop内置的输入文件格式类有: 1)FileInputFormat<K,V>这个是基本的父类,我们自定义就直接使用它作为父类: 2)TextInputFormat<LongWritable,Text>这个是默认的数据格式类,我们一般编程,如果没有特别指定的话,一般都使用的是这个:key代表当前行数据距离文件开始的距离,value代码当前行字符串:

hadoop编程小技巧（7）---自己定义输出文件格式以及输出到不同文件夹

代码測试环境:Hadoop2.4 应用场景:当须要定制输出数据格式时能够採用此技巧,包含定制输出数据的展现形式.输出路径.输出文件名称称等. Hadoop内置的输出文件格式有: 1)FileOutputFormat<K,V> 经常使用的父类. 2)TextOutputFormat<K,V> 默认输出字符串输出格式. 3)SequenceFileOutputFormat<K,V> 序列化文件输出: 4)MultipleOutputs<K,V> 能够把输出数据