Hadoop2.2.0+hive使用LZO压缩那些事

环境:

Centos6.4 64位

Hadoop2.2.0

Sun JDK1.7.0_45

hive-0.12.0

准备工作:

yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool

开始了哦!

(1)安装LZO

wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
tar -zxvf lzo-2.06.tar.gz
./configure -enable-shared -prefix=/usr/local/hadoop/lzo/
make && make test && make install

安装完毕,将/usr/local/hadoop/lzo/lib/* 复制到/usr/lib/和/usr/lib64/下

sudo cp /usr/local/hadoop/lzo/lib/* /usr/lib/

sudo cp /usr/local/hadoop/lzo/lib/* /usr/lib64/

配置环境变量(vim /etc/bashrc):export PATH=/usr/local//hadoop/lzo/:$PATH

(2)安装LZOP
wget http://www.lzop.org/download/lzop-1.03.tar.gz
tar -zxvf lzop-1.03.tar.gz

export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include/

PS:如果不配置,会报错:

configure: error: LZO header files not found. Please check your installation or set the environment variable `CPPFLAGS‘.

接下来,

./configure -enable-shared -prefix=/usr/local/hadoop/lzop
make  && make install

(3)把lzop复制到/usr/bin/
ln -s /usr/local/hadoop/lzop/bin/lzop /usr/bin/lzop

(4)测试lzop
lzop /home/hadoop/data/access_20131219.log

输入lzop

报错:lzop: error while loading shared libraries: liblzo2.so.2: cannot open shared object file: No such file or directory

解决办法:增加环境变量export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64

会在生成一个lzo后缀的压缩文件: /home/hadoop/data/access_20131219.log.lzo即表示前述几个步骤正确哦。
(5)安装Hadoop-LZO

当然的还有一个前提,就是配置好maven和svn 或者Git(我使用的是SVN),这个就不说了,如果这些搞不定,其实也不必要进行下去了!

我这里使用https://github.com/twitter/hadoop-lzo

使用SVN从https://github.com/twitter/hadoop-lzo/trunk下载代码,修改pom.xml文件中的一部分。

从:

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.1.0-beta</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>

修改为:

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.2.0</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>

再依次执行:

mvn clean package -Dmaven.test.skip=true
tar -cBf - -C target/native/Linux-amd64-64/lib . | tar -xBvf - -C /home/hadoop/hadoop-2.2.0/lib/native/
cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/

接下来就是将/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar以及/home/hadoop/hadoop-2.2.0/lib/native/ 同步到其它所有的hadoop节点。注意,要保证目录/home/hadoop/hadoop-2.2.0/lib/native/ 下的jar包,你运行hadoop的用户都有执行权限。

(6)配置Hadoop

在文件$HADOOP_HOME/etc/hadoop/hadoop-env.sh中追加如下内容:

export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib

在文件$HADOOP_HOME/etc/hadoop/core-site.xml中追加如下内容:

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

在文件$HADOOP_HOME/etc/hadoop/mapred-site.xml中追加如下内容:

<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>mapred.child.env</name>
<value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value>
</property>

(7)在Hive中体验lzo

A:首先创建nginx_lzo的表

CREATE TABLE logs_app_nginx (
ip STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
rt STRING,
referer STRING,
agent STRING,
forwarded String
)
partitioned by (
date string,
host string
)
row format delimited
fields terminated by ‘\t‘
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

B:导入数据

LOAD DATA Local INPATH ‘/home/hadoop/data/access_20131230_25.log.lzo‘ INTO TABLE logs_app_nginx PARTITION(date=20131229,host=25);

/home/hadoop/data/access_20131219.log文件的格式如下:

221.207.93.109  -       [23/Dec/2013:23:22:38 +0800]    "GET /ClientGetResourceDetail.action?id=318880&token=Ocm HTTP/1.1"   200     199     0.008   "xxx.com"        "Android4.1.2/LENOVO/Lenovo A706/ch_lenovo/80"   "-"

直接采用lzop  /home/hadoop/data/access_20131219.log即可生成lzo格式压缩文件/home/hadoop/data/access_20131219.log.lzo

C:索引LZO文件

$HADOOP_HOME/bin/hadoop jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/<span style="font-family: Arial, Helvetica, sans-serif;">logs_app_nginx</span>

D:开始跑利用hive来跑map/reduce任务了

set hive.exec.reducers.max=10;
set mapred.reduce.tasks=10;
select ip,rt from nginx_lzo limit 10;

在hive的控制台能看到类似如下格式输出,就表示正确了!

hive> set hive.exec.reducers.max=10;
hive> set mapred.reduce.tasks=10;
hive> select ip,rt from nginx_lzo limit 10;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there‘s no reduce operator
Starting Job = job_1388065803340_0009, Tracking URL = http://lrts216:8088/proxy/application_1388065803340_0009/
Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1388065803340_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-27 09:13:39,163 Stage-1 map = 0%, reduce = 0%
2013-12-27 09:13:45,343 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
2013-12-27 09:13:46,369 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
MapReduce Total cumulative CPU time: 1 seconds 220 msec
Ended Job = job_1388065803340_0009
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.22 sec HDFS Read: 63570 HDFS Write: 315 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 220 msec
OK
221.207.93.109 "XXX.com"
Time taken: 17.498 seconds, Fetched: 10 row(s)

时间: 2024-10-01 06:16:53

Hadoop2.2.0+hive使用LZO压缩那些事的相关文章

【甘道夫】Hive 0.13.1 on Hadoop2.2.0 + Oracle10g部署详解

环境: hadoop2.2.0 hive0.13.1 Ubuntu 14.04 LTS java version "1.7.0_60" Oracle10g ***欢迎转载,请注明来源***    http://blog.csdn.net/u010967382/article/details/38709751 到以下地址下载安装包 http://mirrors.cnnic.cn/apache/hive/stable/apache-hive-0.13.1-bin.tar.gz 安装包解压到

【甘道夫】Hive 0.13.1 on Hadoop2.2.0 + Oracle10g部署详细解释

环境: hadoop2.2.0 hive0.13.1 Ubuntu 14.04 LTS java version "1.7.0_60" Oracle10g ***欢迎转载.请注明来源***    http://blog.csdn.net/u010967382/article/details/38709751 到下面地址下载安装包 http://mirrors.cnnic.cn/apache/hive/stable/apache-hive-0.13.1-bin.tar.gz 安装包解压到

0003-如何在CDH中使用LZO压缩

温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看. 1.问题描述 CDH中默认不支持Lzo压缩编码,需要下载额外的Parcel包,才能让Hadoop相关组件如HDFS,Hive,Spark支持Lzo编码. 具体请参考: https://www.cloudera.com/documentation/enterprise/latest/topics/cm\_mc\_gpl\_extras.html https://www.cloudera.com/documentation/enterp

centos7中基于hadoop安装hive(CentOS7+hadoop2.8.0+hive2.1.1)

1下载hive 下载地址:http://hive.apache.org/downloads.html 点击上图的Download release now! 如图: 点击上图的某个下载地址,我点击的是国内的这个地址:http://mirror.bit.edu.cn/apache/hive/ 如图: 点击进入: apache-hive-2.1.1-bin.tar.gz 2安装 2.1上载和解压缩 在opt目录下新建一个名为hive的目录,将apache-hive-2.1.1-bin.tar.gz拷贝

基于Hadoop2.5.0的集群搭建

http://download.csdn.net/download/yameing/8011891 一. 规划 1.  准备安装包 JDK:http://download.oracle.com/otn-pub/java/jdk/7u67-b01/jdk-7u67-linux-x64.tar.gz Hadoop:http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-2.5.0/hadoop-2.5.0.tar.gz Hive:http://apac

记Hadoop2.5.0线上mapreduce任务执行map任务划分的一次问题解决

前言 近日在线上发现有些mapreduce作业的执行时间很长,我们需要解决这个问题.输入文件的大小是5G,采用了lzo压缩,整个集群的默认block大小是128M.本文将详细描述这次线上问题的排查过程. 现象 线上有一个脚本,为了便于展示,我将这个脚本重新copy了一份并重命名为zzz.这个脚本实际是使用Hadoop streaming运行一个mapreduce任务,在线上执行它的部分输出内容如下: 可以看到map任务划分为1个.这个执行过程十分漫长,我将中间的一些信息省略,map与reduce

Centos7安装Sqoop(CentOS7+Sqoop1.4.6+Hadoop2.8.0+Hive2.1.1)

  注意:本文只讲Sqoop1.4.6的安装.和hive一样,sqoop只需要在hadoop的namenode上安装即可.本例安装sqoop的机器上已经安装了hdoop2.8.0和hive2.1.1,hadoop2.8.0的安装请参考博文:        http://blog.csdn.net/pucao_cug/article/details/71698903 hive2.1.1的安装请参考博文:       http://blog.csdn.net/pucao_cug/article/de

Hadoop2.6.0安装 — 集群

文 / vincentzh 原文连接:http://www.cnblogs.com/vincentzh/p/6034187.html 这里写点 Hadoop2.6.0集群的安装和简单配置,一方面是为自己学习的过程做以记录,另一方面希望也能帮助到和LZ一样的Hadoop初学者,去搭建自己的学习和练习操作环境,后期的 MapReduce 开发环境的配置和 MapReduce 程序开发会慢慢更新出来,LZ也是边学习边记录更新博客,路人如有问题欢迎提出来一起探讨解决,不足的地方希望路人多指教,共勉! 目

Spark-1.4.0单机部署(Hadoop-2.6.0采用伪分布式)【已测】

??目前手上只有一个机器,就先拿来练下手(事先服务器上没有安装软件)尝试一下Spark的单机部署. ??几个参数: ??JDK-1.7+ ??Hadoop-2.6.0(伪分布式): ??Scala-2.10.5: ??Spark-1.4.0: ??下面是具体的配置过程 安装JDK 1.7+ [下载网址]http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html 环境变量设置(最好不要采用o