Combine small files to Sequence file

Combine small files to sequence file or avro files are a good method to feed hadoop.

Small files in hadoop will take more namenode memory resource.

SequenceFileInputFormat 是一种Key value 格式的文件格式。

Key和Value的类型可以自己实现其序列化和反序列化内容。

以下的代码仅供参考作用,真实的项目中使用的时候,可以做适当的调整,以更高地节约资源和满足项目的需要。

示例代码如下:

package myexamples;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

public class localf2seqfile {
/*
 * Local folder has a lot of txt file
 * we need handle it in map reduce
 * so we want to load these files to one sequence file
 * key: source file name
 * value: file content
 * */
	static void write2Seqfile(FileSystem fs,Path hdfspath,HashMap<Text,Text> hm)
	{
	    SequenceFile.Writer writer=null;  

	    try {
	      writer=SequenceFile.createWriter(fs, fs.getConf(), hdfspath, Text.class, Text.class);  

	      for(Map.Entry<Text,Text> entry:hm.entrySet())
	        writer.append(entry.getKey(),entry.getValue());  

	    } catch (IOException e) {
	    	e.printStackTrace();
	    }finally{
	    	try{writer.close();}catch(IOException ioe){}
	    }
	}
	static HashMap<Text,Text> collectFiles(String localpath) throws IOException
	{
		HashMap<Text,Text> hm = new HashMap<Text,Text>();
		File f = new File(localpath);
		if(!f.isDirectory()) return hm;
		for(File file:f.listFiles())
			hm.put(new Text(file.getName()), new Text(FileUtils.readFileToString(file)));

		return hm;
	}
	static void readSeqFile(FileSystem fs, Path hdfspath) throws IOException
	{
		 SequenceFile.Reader reader = new SequenceFile.Reader(fs,hdfspath,fs.getConf());
		         Text key = new Text();
		         Text value = new Text();
		         while (reader.next(key, value)) {
		        	 System.out.print(key +" ");
		        	 System.out.println(value);
		         }
		         reader.close();

	}
	public static void main(String[] args) throws IOException {
		args = "/home/hadoop/test/sub".split(" ");

		Configuration conf = new Configuration();
		conf.set("fs.default.name", "hdfs://namenode:9000");

		FileSystem fs = FileSystem.get(conf);
		System.out.println(fs.getUri());
		Path file = new Path("/user/hadoop/seqfiles/seqdemo.seq");
		if (fs.exists(file)) fs.delete(file,false);
		HashMap<Text,Text> hm = collectFiles(args[0]);
		write2Seqfile(fs,file,hm);
		readSeqFile(fs,file);

		fs.close();
	}
}
时间: 2024-10-22 13:53:45

Combine small files to Sequence file的相关文章

如何限制oracle的trace files及alert file大小

Each server and background process writes to a trace file. When a process detects an internal error, it writes information about the error to its trace file. The file name format of a trace file is sid_processname_unixpid.trc, where: ■sid is the inst

手动创建binary log files和手动编辑binary log index file会有什么影响

一.了解Binary Log结构 1.1.High-Level Binary Log Structure and Contents • Binlog包括binary log files和index file• 每个binary log文件的前4字节是Magic Number,紧接着是一组描述数据修改的Events • The magic number bytes are 0xfe 0x62 0x69 0x6e = 0xfe 'b''i''n' • 每个Event包括header bytes和da

重建控制文件时提示ORA-01189: file is from a different RESETLOGS than previous files

CREATE CONTROLFILE REUSE DATABASE "EDWPROD" RESETLOGS ARCHIVELOG * ERROR at line 1: ORA-01503: CREATE CONTROLFILE failed ORA-01189: file is from a different RESETLOGS than previous files ORA-01110: data file 2: '/ora/prod/edwprod/data/EDWPROD/da

Files and Directories

Files and Directories Introduction In the previous chapter we coveredthe basic functions that perform I/O. The discussion centered on I/O for regular files-opening a file, and reading or writing a file. We'll now look at additionalfeatures of the fil

Linux Booting Process: A step by step tutorial for understanding Linux boot sequence

One of the most remarkable achievement in the history of mankind is computers. Another amazing fact about this remarkable achievement called computers is that its a collection of different electronic components, and they work together in coordination

File System Design Case Studies

SRC=http://www.cs.rutgers.edu/~pxk/416/notes/13-fs-studies.html Paul Krzyzanowski April 24, 2014 Introduction We've studied various approaches to file system design. Now we'll look at some real file systems to explore the approaches that were taken i

Low-overhead enhancement of reliability of journaled file system using solid state storage and de-duplication

A mechanism is provided in a data processing system for reliable asynchronous solid-state device based de-duplication. Responsive to receiving a write request to write data to the file system, the mechanism sends the write request to the file system,

mvc中file无刷新上传文件

前言 上传文件应该是很常见必不可少的一个操作,网上也有很多提供的上传控件.今天遇到一个问题:input控件file无法进行异步无刷新上传.真真的感到别扭.所以就尝试这去处理了一下. 上传封装类: using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using System.IO; using System.Web; n

Guava File操作

Java的基本API对文件的操作很繁琐,为了向文件中写入一行文本,都需要写十几行的代码.guava对此作了很多改进,提供了很多方便的操作. 一. Guava的文件写入 Guava的Files类中提供了几个write方法来简化向文件中写入内容的操作,下面的例子演示 Files.write(byte[],File)的用法. import com.google.common.io.Files; import java.io.File; import java.io.IOException; impor