hadoop compression

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.

1. What to compress?

1) Compressing input files
If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.

If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.

2) Compressing output files
Often we need to store the output as history files. If the amount of output per day is extensive, and we often need to store history results for future use, then these accumulated results will take extensive amount of HDFS space. However, these history files may not be used very frequently, resulting in a waste of HDFS space. Therefore, it is necessary to compress the output before storing on HDFS.

3) Compressing map output
Even if your MapReduce application reads and writes uncompressed data, it may benefit from compressing the intermediate output of the map phase. Since the map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO or Snappy, you can get performance gains simply because the volume of data to transfer is reduced.

2. Common input format

Compression format


Tool


Algorithm


File extention


Splittable


gzip


gzip


DEFLATE


.gz


No


bzip2


bizp2


bzip2


.bz2


Yes


LZO


lzop


LZO


.lzo


Yes if indexed


Snappy


N/A


Snappy


.snappy


No

gzip: 
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.

bzip2: 
bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression.

LZO: 
The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries.  Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds.  It doesn’t compress quite as well as gzip — expect files that are on the order of 50% larger than their gzipped version.  But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster.

Snappy: 
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

Some tradeoffs:
All compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings. The tools listed in above table typically give some control over this trade-off at compression time by offering nine different options: –1 means optimize for speed and -9 means optimize for space.

The different tools have very different compression characteristics. Gzip is a general purpose compressor, and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats. LZO and Snappy, on the other hand, both optimize for speed and are around an order of magnitude faster than gzip, but compress less effectively. Snappy is also significantly faster than LZO for decompression.

3. Issues about compression and input split

When considering how to compress data that will be processed by MapReduce, it is important to understand whether the compression format supports splitting. Consider an uncompressed file stored in HDFS whose size is 1 GB. With an HDFS block size of 64 MB, the file will be stored as 16 blocks, and a MapReduce job using this file as input will create 16 input splits, each processed independently as input to a separate map task.

Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. As before, HDFS will store the file as 16 blocks. However, creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting.

In this case, MapReduce will do the right thing and not try to split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 16 HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.

If the file in our hypothetical example were an LZO file, we would have the same problem since the underlying compression format does not provide a way for a reader to synchronize itself with the stream. However, it is possible to preprocess LZO files using an indexer tool that comes with the Hadoop LZO libraries. The tool builds an index of split points, effectively making them splittable when the appropriate MapReduce input format is used.

A bzip2 file, on the other hand, does provide a synchronization marker between blocks (a 48-bit approximation of pi), so it does support splitting.

4. IO-bound and CPU bound

Storing compressed data in HDFS allows your hardware allocation to go further since compressed data is often 25% of the size of the original data.  Furthermore, since MapReduce jobs are nearly always IO-bound, storing compressed data means there is less overall IO to do, meaning jobs run faster.  There are two caveats to this, however: some compression formats cannot be split for parallel processing, and others are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO.

The gzip compression format illustrates the first caveat. Imagine you have a 1.1 GB gzip file, and your cluster has a 128 MB block size.  This file will be split into 9 chunks of size approximately 128 MB.  In order to process these in parallel in a MapReduce job, a different mapper will be responsible for each chunk. But this means that the second mapper will start on an arbitrary byte about 128MB into the file.  The contextful dictionary that gzip uses to decompress input will be empty at this point, which means the gzip decompressor will not be able to correctly interpret the bytes.  The upshot is that large gzip files in Hadoop need to be processed by a single mapper, which defeats the purpose of parallelism.

Bzip2 compression format illustrates the second caveat in which jobs become CPU-bound. Bzip2 files compress well and are even splittable, but the decompression algorithm is slow and cannot keep up with the streaming disk reads that are common in Hadoop jobs.  While Bzip2 compression has some upside because it conserves storage space, running jobs now spend their time waiting on the CPU to finish decompressing data, which slows them down and offsets the other gains.

5. Summary

Reasons to compress:
a) Data is mostly stored and not frequently processed. It is usual DWH scenario. In this case space saving can be much more significant then processing overhead 
b) Compression factor is very high and thereof we save a lot of IO. 
c) Decompression is very fast (like Snappy) and thereof we have a some gain with little price 
d) Data already arrived compressed

Reasons not to compress
a) Compressed data is not splittable. Have to be noted that many modern format are built with block level compression to enable splitting and other partial processing of the files. b) Data is created in the cluster and compression takes significant time. Have to be noted that compression usually much more CPU intensive then decompression. 
c) Data has little redundancy and compression gives little gain.

时间: 2024-10-01 05:04:27

hadoop compression的相关文章

Hadoop集群内lzo的安装与配置

LZO压缩,可分块并行处理,解压缩的效率也是可以的. 为了配合部门hadoop平台测试,作者详细的介绍了如何在Hadoop平台下安装lzo所需要软件包:gcc.ant.lzo.lzo编码/解码器并配置lzo的文件:core-site.xml.mapred-site.xml.希望对大家有所帮助.以下是正文: 最近我们部门在测试云计算平台hadoop,我被lzo折腾了三四天,累了个够呛.在此总结一下,也给大家做个参考. 操作系统:CentOS 5.5,Hadoop版本:hadoop-0.20.2-C

Hadoop平台优化

一:概述 随着企业要处理的数据量越来越大,MapReduce思想越来越受到重视.Hadoop是MapReduce的一个开源实现,由于其良好的扩展性和容错性,已经得到越来越广泛的应用. 二:存在问题: Hadoop作为一个基础数据处理平台,虽然其应用价值已经得到大家认可,但仍然存在问题,以下是主要几个: 1:Namenode/jobtracker单点故障 Hadoop采用的是master/slaves架构,该架构管理起来比较简单,但存在致命的单点故障和空间容量不足等缺点,这已经严重影响了Hadoo

Hadoop配置项整理(core-site.xml)

记录一下Hadoop的配置和说明,用到新的配置项会补充进来,不定期更新.以配置文件名划分 以hadoop 1.x配置为例 core-site.xml  name value  Description   fs.default.name hdfs://hadoopmaster:9000 定义HadoopMaster的URI和端口  fs.checkpoint.dir /opt/data/hadoop1/hdfs/namesecondary1 定义hadoop的name备份的路径,官方文档说是读取这

【原】centos6.5下hadoop cdh4.6 安装

1.架构准备: namenode 10.0.0.2 secondnamenode 10.0.0.3 datanode1 10.0.0.4 datanode2 10.0.0.6 datanode3 10.0.0.11 2.安装用户:cloud-user 3.[namenode]namenode到其他节点ssh无密码登录: ssh-keygen     (一路回车) ssh-copy-id [email protected]3 ssh-copy-id [email protected]4 ssh-c

Hadoop作业性能指标及參数调优实例 (二)Hadoop作业性能调优7个建议

作者:Shu, Alison Hadoop作业性能调优的两种场景: 一.用户观察到作业性能差,主动寻求帮助. (一)eBayEagle作业性能分析器 1. Hadoop作业性能异常指标 2. Hadoop作业性能调优7个建议 (二)其他參数调优方法 二.Hadoop集群报告异常,发现个别作业导致集群事故. 一.用户观察到作业性能差,主动寻求帮助. (一)eBay Eagle作业性能分析器 对一般作业性能调优.eBay Eagle[i]的作业性能分析器已经能满足用户大部分需求. eBayEagle

升级版:深入浅出Hadoop实战开发(云存储、MapReduce、HBase实战微博、Hive应用、Storm应用)

      Hadoop是一个分布式系统基础架构,由Apache基金会开发.用户可以在不了解分布式底层细节的情况下,开发分布式程序.充分利用集群的威力高速运算和存储.Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS.HDFS有着高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上.而且它提供高传输率(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序

hadoop 常用配置项【转】

hadoop 常用配置项[转] core-site.xml  name value  Description   fs.default.name hdfs://hadoopmaster:9000 定义HadoopMaster的URI和端口  fs.checkpoint.dir /opt/data/hadoop1/hdfs/namesecondary1 定义hadoop的name备份的路径,官方文档说是读取这个,写入dfs.name.dir  fs.checkpoint.period 1800 定

hadoop mapreduce lzo

import com.hadoop.compression.lzo.LzoIndexer; import com.hadoop.compression.lzo.LzopCodec; FileOutputFormat. setCompressOutput( job, true); // 设置压缩 FileOutputFormat. setOutputCompressorClass( job, LzopCodec.class ); // 选择压缩类型 result = job .waitForCom

Hadoop配置项整理(mapred-site.xml)

续上篇 name value Description hadoop.job.history.location   job历史文件保存路径,无可配置参数,也不用写在配置文件里,默认在logs的history文件夹下. hadoop.job.history.user.location   用户历史文件存放位置 io.sort.factor 30 这里处理流合并时的文件排序数,我理解为排序时打开的文件数 io.sort.mb 600 排序所使用的内存数量,单位兆,默认1,我记得是不能超过mapred.