pig的安装及使用

1.下载软件:

wget http://apache.fayea.com/pig/pig-0.15.0/pig-0.15.0.tar.gz

2.解压

tar -zxvf pig-0.15.0.tar.gz

mv pig-0.15.0 /usr/local/

ln -s pig-0.15.0 pig

3.配置环境变量:

export PATH=PATH=$HOME/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/pig/bin:$PATH;

export PIG_CLASSPATH=/usr/local/hadoop/etc/hadoop;

4.进入grunt shell:

以本地模式登录pig: 该方式的所有文件和执行过程都在本地,一般用于测试程序

[[email protected] ~]$ pig -x local

15/10/03 01:14:09 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 01:14:09 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType

2015-10-03 01:14:09,756 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 01:14:09,758 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1443860049744.log

2015-10-03 01:14:10,133 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 01:14:12,648 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 01:14:12,656 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 01:14:12,685 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

2015-10-03 01:14:13,573 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

grunt>

以Mapreduce模式登录:实际工作模式:

[[email protected] ~]$ pig

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-03 02:11:55,086 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 02:11:55,087 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1443863515062.log

2015-10-03 02:11:55,271 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 02:11:59,735 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 02:11:59,740 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 02:11:59,742 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://host61:9000/

2015-10-03 02:12:06,256 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 02:12:06,257 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: host61:9001

2015-10-03 02:12:06,265 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

grunt>

5.pig的运行方式有如下三种:

1.脚本

2.grunt

3.嵌入式

6.登录pig,并使用常用的命令:

[[email protected] ~]$ pig

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-03 06:01:01,412 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 06:01:01,413 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1443877261408.log

2015-10-03 06:01:01,502 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 06:01:03,657 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 06:01:03,657 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 06:01:03,662 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://host61:9000/

2015-10-03 06:01:05,968 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 06:01:05,968 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: host61:9001

2015-10-03 06:01:05,979 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

grunt> help

Commands:

<pig latin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig

File system commands:

fs <fs arguments> - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.

explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]

[-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.

-script - Explain the entire script.

-out - Store the output into directory rather than print to stdout.

-brief - Don‘t expand nested plans (presenting a smaller graph for overview).

-dot - Generate the output in .dot format. Default is text format.

-xml - Generate the output in .xml format. Default is text format.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

alias - Alias to explain.

dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands:

exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -

Execute the script with access to grunt environment including aliases.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

run [-param <param_name>=param_value] [-param_file <file_name>] <script> -

Execute the script with access to grunt environment.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

sh  <shell command> - Invoke a shell command.

kill <job_id> - Kill the hadoop job specified by the hadoop job id.

set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.

The following keys are supported:

default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.

debug - Set debug on or off. Default is off.

job.name - Single-quoted name for jobs. Default is PigLatin:<script name>

job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

stream.skippath - String that contains the path. This is used by streaming.

any hadoop property.

help - Display this message.

history [-n] - Display the list statements in cache.

-n Hide line numbers.

quit - Quit the grunt shell.

grunt> help sh

Commands:

<pig latin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig

File system commands:

fs <fs arguments> - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.

explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]

[-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.

-script - Explain the entire script.

-out - Store the output into directory rather than print to stdout.

-brief - Don‘t expand nested plans (presenting a smaller graph for overview).

-dot - Generate the output in .dot format. Default is text format.

-xml - Generate the output in .xml format. Default is text format.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

alias - Alias to explain.

dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands:

exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -

Execute the script with access to grunt environment including aliases.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

run [-param <param_name>=param_value] [-param_file <file_name>] <script> -

Execute the script with access to grunt environment.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

sh  <shell command> - Invoke a shell command.

kill <job_id> - Kill the hadoop job specified by the hadoop job id.

set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.

The following keys are supported:

default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.

debug - Set debug on or off. Default is off.

job.name - Single-quoted name for jobs. Default is PigLatin:<script name>

job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

stream.skippath - String that contains the path. This is used by streaming.

any hadoop property.

help - Display this message.

history [-n] - Display the list statements in cache.

-n Hide line numbers.

quit - Quit the grunt shell.

2015-10-03 06:02:22,264 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <EOL> "\n "" at line 2, column 8.

Was expecting one of:

"cat" ...

"clear" ...

"cd" ...

"cp" ...

"copyFromLocal" ...

"copyToLocal" ...

"dump" ...

"\\d" ...

"describe" ...

"\\de" ...

"aliases" ...

"explain" ...

"\\e" ...

"help" ...

"history" ...

"kill" ...

"ls" ...

"mv" ...

"mkdir" ...

"pwd" ...

"quit" ...

"\\q" ...

"register" ...

"using" ...

"as" ...

"rm" ...

"set" ...

"illustrate" ...

"\\i" ...

"run" ...

"exec" ...

"scriptDone" ...

<IDENTIFIER> ...

<PATH> ...

<QUOTEDSTRING> ...

Details at logfile: /home/hadoop/pig_1443877261408.log

查看当前目录:

grunt> ls

hdfs://host61:9000/user/hadoop/.Trash <dir>

查看根目录:

grunt> ls /

hdfs://host61:9000/in <dir>

hdfs://host61:9000/out <dir>

hdfs://host61:9000/user <dir>

切换目录:

grunt> cd /

显示当前目录:

grunt> ls

hdfs://host61:9000/in <dir>

hdfs://host61:9000/out <dir>

hdfs://host61:9000/user <dir>

grunt> cd /in

grunt> ls

hdfs://host61:9000/in/jdk-8u60-linux-x64.tar.gz<r 3> 181238643

hdfs://host61:9000/in/mytest1.txt<r 3> 23

hdfs://host61:9000/in/mytest2.txt<r 3> 24

hdfs://host61:9000/in/mytest3.txt<r 3> 4

查看文件信息:

grunt> cat mytest1.txt

this is the first file

拷贝hdfs中的文件至操作系统:

grunt> copyToLocal /in/mytest5.txt /home/hadoop/mytest.txt

[[email protected] ~]$ ls -l mytest.txt

-rw-r--r--. 1 hadoop hadoop 102 Oct  3 06:23 mytest.txt

使用sh+操作系统命令可以在grunt中执行操作系统中的命令:

grunt> sh ls -l /home/hadoop/mytest.txt

-rw-r--r--. 1 hadoop hadoop 102 Oct  3 06:23 /home/hadoop/mytest.txt

7.pig的数据模型:

bag:表

tuple:行,记录

field:属性

pig不要求相同bag里面的不同tuple有相同数量或相同类型的field;

8.pig latin的常用语句:

LOAD:指出载入数据的方法;

FOREACH:逐行扫描并进行某种处理;

FILTER:过滤行;

DUMP:把结果显示到屏幕;

STORE:把结果保存到文件;

9.数据处理样例:

产生测试文件:

[[email protected] tmp]$ ls -l / |awk ‘{if(NR != 1)print $NF"#"$5}‘ >/tmp/mytest.txt

[[email protected] tmp]$ cat /tmp/mytest.txt

bin#4096

boot#1024

dev#3680

etc#12288

home#4096

lib#4096

lib64#12288

lost+found#16384

media#4096

mnt#4096

opt#4096

proc#0

root#4096

sbin#12288

selinux#0

srv#4096

sys#0

tmp#4096

usr#4096

var#4096

装载文件:

grunt> records = LOAD ‘/tmp/mytest.txt‘ USING PigStorage(‘#‘) AS (filename:chararray,size:int);

2015-10-03 07:35:48,479 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 07:35:48,480 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 07:35:48,497 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

2015-10-03 07:35:48,716 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 07:35:48,723 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

2015-10-03 07:35:48,723 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

显示文件:

grunt> DUMP records;

(bin,4096)

(boot,1024)

(dev,3680)

(etc,12288)

(home,4096)

(lib,4096)

(lib64,12288)

(lost+found,16384)

(media,4096)

(mnt,4096)

(opt,4096)

(proc,0)

(root,4096)

(sbin,12288)

(selinux,0)

(srv,4096)

(sys,0)

(tmp,4096)

(usr,4096)

(var,4096)

显示records的结构:

grunt> DESCRIBE records;

records: {filename: chararray,size: int}

过滤记录:

grunt> filter_records =FILTER records BY size>4096;

grunt> DUMP fileter_records;

(etc,12288)

(lib64,12288)

(lost+found,16384)

(sbin,12288)

grunt> DESCRIBE filter_records;

filter_records: {filename: chararray,size: int}

分组:

grunt> group_records =GROUP records BY size;

grunt> DUMP group_records;

(0,{(sys,0),(proc,0),(selinux,0)})

(1024,{(boot,1024)})

(3680,{(dev,3680)})

(4096,{(var,4096),(usr,4096),(tmp,4096),(srv,4096),(root,4096),(opt,4096),(mnt,4096),(media,4096),(lib,4096),(home,4096),(bin,4096)})

(12288,{(etc,12288),(lib64,12288),(sbin,12288)})

(16384,{(lost+found,16384)})

grunt> DESCRIBE group_records;

group_records: {group: int,records: {(filename: chararray,size: int)}}

格式化:

grunt> format_records = FOREACH group_records GENERATE group, FLATTEN(records);

去重:

grunt> dis_records =DISTINCT records;

排序:

grunt> ord_records =ORDER dis_records BY size desc;

取前3行数据:

grunt> top_records=LIMIT ord_records 3;

求最大值:

grunt> max_records =FOREACH group_records GENERATE group,MAX(records.size);

grunt> DUMP max_records;

(0,0)

(1024,1024)

(3680,3680)

(4096,4096)

(12288,12288)

(16384,16384)

查看执行计划:

grunt> EXPLAIN max_records;

保存记录集:

grunt> STORE group_records INTO ‘/tmp/mytest_group‘;

grunt> STORE filter_records INTO ‘/tmp/mytest_filter‘;

grunt> STORE max_records INTO ‘/tmp/mytest_max‘;

10.UDF

pig支持使用java,python,javascript编写UDF

时间: 2024-08-09 22:02:13

pig的安装及使用的相关文章

Pig的安装和使用方法

本文使用的pig版本是pig-0.12.0.tar.gz,在安装以前已经安装好了hadoop,hadoop的安装方法参考 hadoop-1.2.1安装方法详解 pig的安装方法很简单,配置一下环境即可,pig有两种工作模式:本地模式和MapReduce模式(默认). 1.上传并解压pig-0.12.0.tar.gz [[email protected] temp]$ tar zxf pig-0.12.0.tar.gz 2.配置pig的环境变量并使之生效 export PIG_HOME=/home

[hadoop系列]Pig的安装和简单演示样例

inkfish原创,请勿商业性质转载,转载请注明来源(http://blog.csdn.net/inkfish ).(来源:http://blog.csdn.net/inkfish) Pig是Yahoo!捐献给Apache的一个项目,眼下还在Apache孵化器(incubator)阶段,眼下版本号是v0.5.0.Pig是一个基于Hadoop的大规模数据分析平台,它提供的SQL-like语言叫Pig Latin,该语言的编译器会把类SQL的数据分析请求转换为一系列经过优化处理的MapReduce运

Pig的安装和简单实用

1.Pig是基于hadoop的一个数据处理的框架. MapReduce是使用java进行开发的,Pig有一套自己的数据处理语言,Pig的数据处理过程要转化为MR来运行.2.Pig的数据处理语言是数据流方式的,类似于初中做的数学题.3.Pig基本数据类型:int.long.float.double.chararray.bytearray 复合数据类型:Map.Tuple.Bag Bag的类型如{('age',31),('name','张三')} 4.如何安装Pig4.1 把pig-0.11.1.t

Pig的安装配置

一.简介 Pig是基于hadoop的一个数据处理的框架.相对于MapReduce是使用java进行开发的,Pig有一套自己的数据处理语言,Pig的数据处理过程要转化为MR来运行. Pig基本数据类型:int.long.float.double.chararry.bytearray 复合数据类型:Map.Tuple.Bag Bag的类型如{('age',31),('name','张三')} 二.安装配置 我使用的是pig-0.11.1版本,下载地址http://pan.baidu.com/s/1s

Pig安装及简单使用(pig版本0.13.0,Hadoop版本2.5.0)

原文地址:http://www.linuxidc.com/Linux/2014-03/99055.htm 我们用MapReduce进行数据分析.当业务比较复杂的时候,使用MapReduce将会是一个很复杂的事情,比如你需要对数据进行很多预处理或转换,以便能够适应MapReduce的处理模式,另一方面,编写MapReduce程序,发布及运行作业都将是一个比较耗时的事情. Pig的出现很好的弥补了这一不足.Pig能够让你专心于数据及业务本身,而不是纠结于数据的格式转换以及MapReduce程序的编写

ubuntu下安装pig

转载自: http://blog.csdn.net/a925907195/article/details/42325579 1 安装 只安装在namenode节点上即可 1.1 下载并解压 下载:http://pig.apache.org/releases.html下载pig-0.12.1版本的pig-0.12.1.tar.gz 存放路径:/home/Hadoop/ 解压:tar -zxvf pig-0.12.1.tar.gz 改名:mv pig-0.12.1 pig 然后放到/usr/loca

hadoop生态圈安装详解(hadoop+zookeeper+hbase+pig+hive)

目录 1.hadoop分布式安装 2.zookeeper分布式安装 3.hbase分布式安装 4.pig分布式安装 5.hive客户端安装

Hadoop之Pig安装

Pig可以看做是Hadoop的客户端软件,使用Pig Latin语言可以实现排序.过滤.求和.分组等操作. Pig的安装步骤: 一.去Pig的官方网站下载.http://pig.apache.org/releases.html#14+April%2C+2014%3A+release+0.12.1+available 这里我选择的是14 October, 2013: release 0.12.0 available 这个版本.将pig-0.12.0.tar.gz 下载到本地中. 二.将Pig放在了

Pig安装讲解

Pig 简介: Pig 是 Apache 项目的一个子项目,Pig 提供了一个支持大规模数据分析的平台,Pig 突出的特点就是它的结构经得起大量并行任务的检验,使得它能够处理大规模数据集 Pig  特点: Pig 可简化 MapReduce 任务的开发 Pig 可以看做 Hadoop 的客户端软件,可以连接到 Hadoop 集群进行数据分析工作 Pig 方便不熟悉 Java 的用户,使用一种较为简便的类似 SQL 的面向数据流的语言 PigLatin 语言进行数据处理 PigLatin 可以进行