Write Custom Java to Create LZO Files

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO

LanguageManual LZO

Skip to end of metadata

Go to start of metadata

LZO Compression

General LZO Concepts

LZO is a lossless data compression library that favors speed over compression ratio. See http://www.oberhumer.com/opensource/lzo and http://www.lzop.org for general information about LZO and see Compressed Data Storage for information about compression in Hive.

Imagine a simple data file that has three columns

  • id
  • first name
  • last name

Let‘s populate a data file containing 4 records:

19630001     john          lennon
19630002     paul          mccartney
19630003     george        harrison
19630004     ringo         starr

Let‘s call the data file /path/to/dir/names.txt.

In order to make it into an LZO file, we can use the lzop utility and it will create a names.txt.lzo file.

Now copy the file names.txt.lzo to HDFS.

Prerequisites

Lzo/Lzop Installations

lzo and lzop need to be installed on every node in the Hadoop cluster. The details of these installations are beyond the scope of this document.

core-site.xml

Add the following to your core-site.xml:

  • com.hadoop.compression.lzo.LzoCodec
  • com.hadoop.compression.lzo.LzopCodec

For example:

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>

<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

Next we run the command to create an LZO index file:

hadoop jar /path/to/jar/hadoop-lzo-cdh4-0.4.15-gplextras.jar com.hadoop.compression.lzo.LzoIndexer  /path/to/HDFS/dir/containing/lzo/files

This creates names.txt.lzo on HDFS.

Table Definition

The following hive -e command creates an LZO-compressed external table:

hive -e "CREATE EXTERNAL TABLE IF NOT EXISTS hive_table_name (column_1  datatype_1......column_N datatype_N)
         PARTITIONED BY (partition_col_1 datatype_1 ....col_P  datatype_P)
         ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘
         STORED AS INPUTFORMAT  \"com.hadoop.mapred.DeprecatedLzoTextInputFormat\"
                   OUTPUTFORMAT \"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\";

Note: The double quotes have to be escaped so that the ‘hive -e‘ command works correctly.

See CREATE TABLE and Hive CLI for information about command syntax.

Hive Queries

Option 1: Directly Create LZO Files

  1. Directly create LZO files as the output of the Hive query.
  2. Use lzop command utility or your custom Java to generate .lzo.index for the .lzo files.

Hive Query Parameters

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true

For example:

hive -e "SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec; SET hive.exec.compress.output=true;SET mapreduce.output.fileoutputformat.compress=true; <query-string>"

     Note: If the data sets are large or number of output files are large , then this option does not work.

Option 2: Write Custom Java to Create LZO Files

  1. Create text files as the output of the Hive query.
  2. Write custom Java code to
    1. convert Hive query generated text files to .lzo files
    2. generate .lzo.index files for the .lzo files generated above

Hive Query Parameters

Prefix the query string with these parameters:

SET hive.exec.compress.output=false
SET mapreduce.output.fileoutputformat.compress=false

For example:

hive -e "SET hive.exec.compress.output=false;SET mapreduce.output.fileoutputformat.compress=false;<query-string>"
时间: 2024-11-05 23:18:40

Write Custom Java to Create LZO Files的相关文章

VBA bat create excel files

the database is current excel when you click "create" button, then create excel by customer code . in the data source have a lot of data,contains customer code and the detail info . one customer code point many detail info each a customer code c

Java使用JSP Tag Files &amp; JSP EL Functions打造你自己的页面模板

1. 简单说明:在JSP 2.0后, 你不再需要大刀阔斧地定义一堆TagSupport或BodyTagSupport, 使用JSP Tag Files技术可以实现功能强大的页面模板技术. 在这里抛砖引玉, 结合项目开发, 简单介绍Tag Files技术的应用. 至于详细教程与资料, 请大家参考Java EE Tutorial, 上面有详细的E文资料.http://docs.oracle.com/javaee/5/tutorial/doc/bnama.html2. 定义模板:/WEB-INF/ta

[Tools] Batch Create Markdown Files from a Template with Node.js and Mustache

Creating Markdown files from a template is a straightforward process with Node.js and Mustache. You can define a template, load it into your script, then push whatever data you have into your template, then write the files back out. Node.js built-in

Watch out for these 10 common pitfalls of experienced Java developers &amp; architects--转

原文地址:http://zeroturnaround.com/rebellabs/watch-out-for-these-10-common-pitfalls-of-experienced-java-developers-architects/ Can we start by asking a serious question? How easy is it to find advice for novice Java programmers on the web? Whenever I loo

Resolving Problems installing the Java JCE Unlimited Strength Jurisdiction Policy Files package--转

原文地址:https://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.tec1698523.html Introduction/Summary:  The base Java JVM and SDK installs from Oracle are limited in strength for the cryptographic functions that they c

Java Secure Socket Extension (JSSE) Reference Guide

Skip to Content Oracle Technology Network Software Downloads Documentation Search Java Secure Socket Extension (JSSE) Reference Guide This guide covers the following topics: Skip Navigation Links Introduction Features and Benefits JSSE Standard API S

Java性能提示(全)

http://www.onjava.com/pub/a/onjava/2001/05/30/optimization.htmlComparing the performance of LinkedLists and ArrayLists (and Vectors) (Page last updated May 2001, Added 2001-06-18, Author Jack Shirazi, Publisher OnJava). Tips: ArrayList is faster than

4 kafka集群部署及生产者java客户端编程 + kafka消费者java客户端编程

本博文的主要内容有   kafka的单机模式部署 kafka的分布式模式部署 生产者java客户端编程 消费者java客户端编程 运行kafka ,需要依赖 zookeeper,你可以使用已有的 zookeeper 集群或者利用 kafka自带的zookeeper. 单机模式,用的是kafka自带的zookeeper, 分布式模式,用的是外部安装的zookeeper,即公共的zookeeper. Step 6: Setting up a multi-broker cluster So far w

基础知识(2)- Java SE 8 Programmer II (1z0-809)

Java Class Design Implement encapsulation Implement inheritance including visibility modifiers and composition Implement polymorphism Override hashCode, equals, and toString methods from Object class Create and use singleton classes and immutable cla