日志文件格式如下:
220.181.108.151 - - [31/Jan/2012:00:02:32 +0800] "GET /home.php?mod=space&uid=158&do=album&view=me&from=space HTTP/1.1" 200 8784 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 208.115.113.82 - - [31/Jan/2012:00:07:54 +0800] "GET /robots.txt HTTP/1.1" 200 582 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; [email protected])" 220.181.94.221 - - [31/Jan/2012:00:09:24 +0800] "GET /home.php?mod=spacecp&ac=pm&op=showmsg&handlekey=showmsg_3&touid=3&pmid=0&daterange=2&pid=398&tid=66 HTTP/1.1" 200 10070 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" 112.97.24.243 - - [31/Jan/2012:00:14:48 +0800] "GET /data/cache/style_2_common.css?AZH HTTP/1.1" 200 57752 "http://f.dataguru.cn/forum-58-1.html" "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9A406"
一、Pig下载:
下载地址:http://www.apache.org/dyn/closer.cgi/pig
二、Pig安装:
解压
[[email protected] ~]$ tar -zxf pig-0.14.0.tar.gz
设置环境变量
[[email protected] ~]$ vi .bash_profile
PIG_INSTALL=/home/grid/pig-0.14.0
PIG_CLASSPATH=/home/grid/hadoop-1.2.1/conf/
PATH=$PATH:$PIG_INSTALL/bin
export PIG_INSTALL PATH PIG_CLASSPATH
设置JAVA_HOME
修改hosts文件
验证
[[email protected] ~]$ pig -help
连接到Hadoop集群
[[email protected] ~]$ pig
grunt> ls
hdfs://hadoop1:9000/user/grid/in <dir>
hdfs://hadoop1:9000/user/grid/out <dir>
三、开始作业
加载数据
grunt> A = LOAD ‘in/8/access_log.txt‘ USING PigStorage (‘ ‘) AS ( ip, page);
grunt> DESCRIBE A;
A: {ip: bytearray,page: bytearray}
去掉用不着的信息
grunt> B = FOREACH A GENERATE ip;
分组
grunt> C = GROUP B BY ip;
grunt> DESCRIBE C;
C: {group: bytearray,B: {(ip: bytearray)}}
统计
grunt> D = FOREACH C GENERATE group AS ip, COUNT(B) AS count;
查看结果
grunt> DUMP D;
(127.0.0.1,2)
(1.59.65.67,2)
(112.4.2.19,9)
(112.4.2.51,80)
(60.2.99.33,42)
(69.28.58.5,1)
(69.28.58.6,9)
(69.28.58.8,5)
(1.193.3.227,3)
(1.202.221.3,6)
(117.136.9.4,6)
(121.31.62.3,26)
(182.204.8.4,59)
(183.9.112.2,25)
(221.12.37.6,25)
(223.4.16.88,2)
(27.9.110.75,122)