hadoop编程小技巧（6）---处理大量小数据文件CombineFileInputFormat应用

代码测试环境：Hadoop2.4

应用场景：当需要处理很多小数据文件的时候，可以应用此技巧来达到高效处理数据的目的。

原理：应用CombineFileInputFormat，可以把多个小数据文件在进行分片的时候合并。由于每个分片会产生一个Mapper，当一个Mapper处理的数据比较小的时候，其效率较低。而一般使用Hadoop处理数据时，即默认方式，会把一个输入数据文件当做一个分片，这样当输入文件较小时就会出现效率低下的情况。

实例：

参考前篇blog：hadoop编程小技巧（5）---自定义输入文件格式类InputFormat，不过这次输入使用两个输入文件，都是小数据量的数据文件。

自定义输入文件格式：CustomCombineFileInputFormat：

package fz.combineinputformat;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
/**
 * 定义读取类
 * @author fansy
 *
 */
public class CustomCombineFileInputFormat extends CombineFileInputFormat<Text, Text> {

	@Override
	public RecordReader<Text, Text> createRecordReader(InputSplit split,
			TaskAttemptContext context) throws IOException {
		// TODO Auto-generated method stub
		return new CombineFileRecordReader<Text, Text>((CombineFileSplit)split,context,CustomCombineReader.class);
	}

}

自定义记录读取类CustomCombineReader：

package fz.combineinputformat;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
/**
 * 修改初始化函数
 * @author fansy
 *
 */
public class CustomCombineReader extends RecordReader<Text, Text> {

	private int index;
	private CustomReader in;

	public CustomCombineReader(CombineFileSplit split,TaskAttemptContext cxt,Integer index){
			this.index=index;
			this.in= new CustomReader();
	}
	@Override
	public void initialize(InputSplit split, TaskAttemptContext context)
			throws IOException, InterruptedException {
		CombineFileSplit cfsplit= (CombineFileSplit) split;
		FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),cfsplit.getOffset(index),
				cfsplit.getLength(),cfsplit.getLocations());
		in.initialize(fileSplit, context);
	}

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		return in.nextKeyValue();
	}

	@Override
	public Text getCurrentKey() throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		return in.getCurrentKey();
	}

	@Override
	public Text getCurrentValue() throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		return in.getCurrentValue();
	}

	@Override
	public float getProgress() throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		return in.getProgress();
	}

	@Override
	public void close() throws IOException {
		// TODO Auto-generated method stub
		in.close();
	}

}

可以看到这个类使用了上篇博客的CustomReader类，只是修改了下初始化函数，使得小数据量的文件可以合并到一个分片而已。CustomReader可以参考前篇blog：hadoop编程小技巧（5）---自定义输入文件格式类InputFormat 。

主类，只需修改（同样参考前篇blog）：

job.setInputFormatClass(CustomCombineFileInputFormat.class);

进行了两次实验，第一次使用CombineFileInputFormat读取，第二次使用TextInputFormat读取。

结果查看：

首先可以从终端看出来：

可以看到同样的两个输入文件，任务096只有一个分片，任务097有两个分片；

同时在任务监控界面也可以看到Mapper的个数变化：

总结：CombineFileInputFormat具有很强的应用价值，针对大量小数据具有很高的处理效率收益。不过，如果是大数据应用，一般情况下可能输入数据都是很大的，所以，这种情况也只是针对一些特殊情况的处理。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

hadoop编程小技巧（6）---处理大量小数据文件CombineFileInputFormat应用

时间： 2025-01-14 09:02:39

hadoop编程小技巧（6）---处理大量小数据文件CombineFileInputFormat应用

hadoop编程小技巧（6）---处理大量小数据文件CombineFileInputFormat应用的相关文章

hadoop编程技巧（6）---处理大量的小型数据文件CombineFileInputFormat申请书

iOS开发——小技巧：Mac开源小软件PushMeBaby，还要啥后端，测试APP推送只靠Xcode！

Cocos2dx 小技巧（一）预定义文件路径

小技巧--解决eclipse导入的jar文件后，无法使用默认包中的方法问题

给大家分享12个或许能在实际工作中帮助你解决一些问题的JavaScript的小技巧

AOPR小技巧解释

12个非常实用的JavaScript小技巧

整理：Android Studio的常用快捷键、常用小技巧

前端知识：12个非常实用的JavaScript小技巧