Nginx日志导入到Hive0.13.1,同步Hbase0.96.2,设置RowKey为autoincrement(ID自增长) / 憋错料

目录：

　　一、Nginx的相关配置，以及测试数据

　　二、Hadoop中创建使用目录

　　三、Hive的日志表创建，并同步到Hbase中

　　　　1)创建自增长ID的类,并添加到Hive环境中

　　　　2)在Hive中创建日志表

　　　　3)将Nginx的日志数据导入到Hive日志表中

　　　　　　方法一：将本地目录中的文件导入到hive中

　　　　　　方法二：从HDFS导入数据

　　　　　　方法三：通过如Flume、Kafka一类的工具监控日志数据，自动导入

　　　　4)在Hive中创建Hbase可以识别的表

　　　　5)将Hive的日志表，使用自增长Id做为rowkey导入到Hbase可识别的表中

　　　　6)到Hbase中验证效果使用get、scan、filter获取指定数据

　　四、FAQ

　　五、扩展阅读

　　一、Nginx的相关配置，以及测试数据

　　#Nginx的Conf中log格式化部分的内容

log_format main ‘$remote_addr - $remote_user [$time_local] "$request" ‘

‘$status $body_bytes_sent "$http_referer" ‘

‘"$http_user_agent" "$http_x_forwarded_for" ‘

‘"$gzip_ratio" $request_time $bytes_sent $request_length ‘

‘"$upstream_addr" $upstream_status $upstream_response_time‘;

　　#Nginx生成后部分格式如下(测试数据)

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET / HTTP/1.1" 200 2373 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "2.78" 0.004 2683 369 "unix:/var/run/php5-fpm.sock" 200 0.004

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/css/base.css HTTP/1.1" 200 940 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 1247 373 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/a3e2e507/jquery.min.js HTTP/1.1" 200 93636 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.152 93976 359 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/logo.png HTTP/1.1" 200 6059 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 6369 377 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p02.jpg HTTP/1.1" 200 22177 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 22489 376 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p03.png HTTP/1.1" 200 3012 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 3321 376 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/two-dimension-code1.png HTTP/1.1" 200 761 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 1069 392 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/bg.png HTTP/1.1" 200 11474 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 11785 375 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p04.png HTTP/1.1" 200 2860 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 3169 376 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p06.png HTTP/1.1" 200 74097 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.062 74409 376 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p05.png HTTP/1.1" 200 132072 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.256 132385 376 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:46 +0800] "GET 8.8.8.8/b519d8ca/image/p07.png HTTP/1.1" 200 207987 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.592 208300 376 "-" - -

8.8.8.8 - - [22/Aug/2014:20:23:46 +0800] "GET 8.8.8.8/b519d8ca/image/p01.png HTTP/1.1" 200 310418 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.918 310731 376 "-" - -

　　二、Hadoop中创建使用目录

　　#在Hadoop2.2.0中创建Hive外部表中要用到的目录

[email protected]:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -mkdir /user/hive/warehouse/nginxlog

[email protected]:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -ls /user/hive/warehouse

Found 1 items

drwxr-xr-x - root supergroup 0 2014-01-22 23:13 /user/hive/warehouse/nginxlog

[email protected]:/home/hadoop#

　　三、Hive的日志表创建，并同步到Hbase中

　　　　1)创建自增长ID的java类UDFRowSequence.java,并添加到Hive环境中
　　#在Eclipse中创建Maven项目，然后使用Maven将项目打包成Jar文件，过程中需要引入hadoop-common-2.2.0.jar、hive-exec-0.13.1.jar这两个文件，可以在Hadoop2.2.0目录以及Hive0.13.1目录中找到。如果不会使用Maven将项目打包，可以参考这篇文章《Golang、Php、Python、Java基于Thrift0.9.1实现跨语言调用》中实现Java客户端部分，有如何打包的方法。

　　UDFRowSequence.java代码如下，使用Maven会打包成一个idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar文件，上传到Hive的lib目录下。本文后面会提供代码下载：

package idoall.org.hive;

import org.apache.hadoop.hive.ql.exec.Description;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.hive.ql.udf.UDFType;

import org.apache.hadoop.io.LongWritable;

/**

* UDFRowSequence.

*/

@Description(name = "row_sequence",

value = "_FUNC_() - Returns a generated row sequence number starting from 1")

@UDFType(deterministic = false, stateful = true)

public class UDFRowSequence extends UDF

{

private LongWritable result = new LongWritable();

public UDFRowSequence() {

result.set(0);

}

public LongWritable evaluate() {

result.set(result.get() + 1);

return result;

}

　　　　2)在Hive中创建日志表

　　#然后将我们自定义的idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar，添加到Hive运行环境中

hive> ADD JAR /home/hadoop/hive-0.13.1/lib/idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar;

Added /home/hadoop/hive-0.13.1/lib/idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar to class path

Added resource: /home/hadoop/hive-0.13.1/lib/idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar

hive>

　　#在Hive中创建自定义函数对数据进行处理，如果不想每次都执行这个命令，可以在hive-site.xml里面的<name>hive.aux.jars.path</name> 节点下面，把jar文件加到<value>节点中。

hive> CREATE TEMPORARY FUNCTION rowSequence AS ‘idoall.org.hive.UDFRowSequence‘;

OK

Time taken: 0.048 seconds

　　#在Hive中创建收集Nginx的日志表（nginx_accesslog）

create external table nginx_accesslog(

host string,

hostuser string,

times string,

requestmethond string,

requesturl string,

requesthttp string,

status string,

body_bytes_sent string,

referer string,

useragent string,

http_x_forwarded_for string,

gzip_ratio string,

request_time string,

bytes_sent string,

request_length string,

upstream_addr string,

upstream_status string,

upstream_response_time string)

PARTITIONED BY(YEAR STRING, MONTH STRING, DAY STRING)

row format SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘

WITH SERDEPROPERTIES ("input.regex" = "([^ ]*)\\s+-\\s+(.+?|-)\\s+\\[(.*)\\]\\s+\"([^ ]*)\\s+([^ ]*)\\s+([^ |\"]*)\"\\s+(-|[0-9]*)\\s+(-|[0-9]*)\\s+\"(.+?|-)\"\\s+\"(.+?|-)\"\\s+\"(.+?|-)\"\\s+\"(.+?|-)\"\\s+(.+?|-)\\s+(.+?|-)\\s+(.+?|-)\\s+\"(.+?|-)\"\\s+(.+?|-)\\s+(.*)","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s")

STORED AS TEXTFILE

location ‘/user/hive/warehouse/nginxlog‘;

　　　　3)将Nginx的日志数据导入到Hive日志表中

关于导入数据过程中overwrite的特性：

•目标表（或者分区）中的内容（如果有）会被删除，然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。

•如果目标表（分区）已经有一个文件，并且文件名和 filepath 中的文件名冲突，那么现有的文件会被新文件所替代

如果不使用overwrite,改用into,只会增加记录。

　　　　　　方法一：将本地目录中的文件导入到Hive中

hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/hive-0.13.1/a.com.access.20140821.log‘ OVERWRITE INTO TABLE nginx_accesslog partition (YEAR=‘2014‘, MONTH=‘08‘,DAY=‘21‘);

Copying data from file:/home/hadoop/hive-0.13.1/a.com.access.20140821.log

Copying file: file:/home/hadoop/hive-0.13.1/a.com.access.20140821.log

Loading data to table default.nginx_accesslog partition (year=2014, month=08, day=21)

Partition default.nginx_accesslog{year=2014, month=08, day=21} stats: [numFiles=1, numRows=0, totalSize=3483, rawDataSize=0]

OK

Time taken: 1.046 seconds

hive> select count(0) from nginx_accesslog;

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1408550631561_0005, Tracking URL = http://m1:8088/proxy/application_1408550631561_0005/

Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1408550631561_0005

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2014-08-22 23:19:55,322 Stage-1 map = 0%, reduce = 0%

2014-08-22 23:20:01,669 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec

2014-08-22 23:20:08,926 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.59 sec

MapReduce Total cumulative CPU time: 1 seconds 590 msec

Ended Job = job_1408550631561_0005

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.59 sec HDFS Read: 3734 HDFS Write: 3 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 590 msec

OK

13

Time taken: 24.762 seconds, Fetched: 1 row(s)

hive>

　　　　　　方法二：从HDFS导入数据,我们先将表删除，重新创建一次，然后尝试从HDFS导入数据

　　#先用Hadoop命令，将文件上传到HDFS中

[email protected]:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -copyFromLocal /home/hadoop/hive-0.13.1/a.com.access.20140821.log /user/hive/warehouse/nginxlog/

[email protected]:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -ls /user/hive/warehouse/nginxlog

Found 1 items

-rw-r--r-- 3 root supergroup 3483 2014-08-22 23:18 /user/hive/warehouse/nginxlog/a.com.access.20140821.log

[email protected]:/home/hadoop#

　　#执行Hive命令，从HDFS导入数据，然后使用SELECT语句，可以看到Nginx的日志数据已经成功导入到Hive表中

#先删除表

hive> drop table nginx_accesslog;

OK

Time taken: 0.363 seconds

hive>

#再创建表

....此处省略，参考上文命令重新创建一次

#从HDFS导入数据（如果文件存在，要先删除），从下图可以看到，数据导入成功

hive> LOAD DATA inpath ‘/user/hive/warehouse/nginxlog/a.com.access.20140821.log‘ overwrite INTO TABLE nginx_accesslog partition (YEAR=‘2014‘, MONTH=‘08‘,DAY=‘21‘);

Loading data to table default.nginx_accesslog partition (year=2014, month=08, day=21)

Partition default.nginx_accesslog{year=2014, month=08, day=21} stats: [numFiles=1, numRows=0, totalSize=3483, rawDataSize=0]

OK

Time taken: 0.373 seconds

hive> select * from nginx_accesslog limit 100;

OK

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET / HTTP/1.1 200 2373 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - 2.78 0.004 2683 369 unix:/var/run/php5-fpm.sock 200 0.004 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/css/base.css HTTP/1.1 200 940 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.000 1247 373 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/a3e2e507/jquery.min.js HTTP/1.1 200 93636 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.152 93976 359 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/logo.png HTTP/1.1 200 6059 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.000 6369 377 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/p02.jpg HTTP/1.1 200 22177 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.000 22489 376 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/p03.png HTTP/1.1 200 3012 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.000 3321 376 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/two-dimension-code1.png HTTP/1.1 200 761 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.000 1069 392 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/bg.png HTTP/1.1 200 11474 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.000 11785 375 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/p04.png HTTP/1.1 200 2860 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.000 3169 376 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/p06.png HTTP/1.1 200 74097 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.062 74409 376 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:45 +0800 GET 8.8.8.8/b519d8ca/image/p05.png HTTP/1.1 200 132072 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.256 132385 376 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:46 +0800 GET 8.8.8.8/b519d8ca/image/p07.png HTTP/1.1 200 207987 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.592 208300 376 - - - 2014 08 21

8.8.8.8 - 22/Aug/2014:20:23:46 +0800 GET 8.8.8.8/b519d8ca/image/p01.png HTTP/1.1 200 310418 http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 - - 0.918 310731 376 - - - 2014 08 21

Time taken: 0.056 seconds, Fetched: 13 row(s)

hive>

　　　　　　方法三：通过如Flume、Kafka一类的工具监控日志数据，自动导入

　　　就不在这里细说，可以参考这两篇文章里面的实例，根据自己的业务需求修改代码：《Flume1.5.0的安装、部署、简单应用(含分布式、与hadoop2.2.0、hbase0.96的案例)》、《Flume+Kafka+Strom基于分布式环境的结合使用》

　　　　4)在Hive中创建Hbase可以识别的表

　　#在Hive中创建Hbase可以识别的表，相当于一张中间表了。同时将之前的测试数据导入到这张中间表，会自动同步到Hbase。

CREATE TABLE h2b_nginx_accesslog(

key int,

host string,

hostuser string,

times string,

requestmethond string,

requesturl string,

requesthttp string,

status string,

body_bytes_sent string,

referer string,

useragent string,

http_x_forwarded_for string,

gzip_ratio string,

request_time string,

bytes_sent string,

request_length string,

upstream_addr string,

upstream_status string,

upstream_response_time string)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler‘

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,log:host,log:hostuser,log:times,log:requestmethond,log:requesturl,log:requesthttp,log:status,log:body_bytes_sent,log:referer,log:useragent,log:http_x_forwarded_for,log:gzip_ratio,log:request_time,log:bytes_sent,log:request_length,log:upstream_addr,log:upstream_status,log:upstream_response_time") TBLPROPERTIES ("hbase.table.name" = "h2b_nginx_accesslog");

　　#这时在Hbase中，可以看到h2b_nginx_accesslog表自动创建了；

hbase(main):002:0> list

TABLE

h2b_nginx_accesslog

1 row(s) in 0.1220 seconds

=> ["h2b_nginx_accesslog"]

/*　　查看表结构时，只会显示列族，而不会显示列。Hbase表中的每个列，都归属与某个列族。列族是表的chema的一部分(而列不是)。*/

hbase(main):003:0> describe "h2b_nginx_accesslog"

DESCRIPTION ENABLED

‘h2b_nginx_accesslog‘, {NAME => ‘log‘, DATA_BLOCK_ENCODING => ‘NONE‘, BLOOMFILTER => ‘ROW‘, REPLICATION_SCOPE => ‘0‘, VERSIONS => ‘1‘, CO true

MPRESSION => ‘NONE‘, MIN_VERSIONS => ‘0‘, TTL => ‘2147483647‘, KEEP_DELETED_CELLS => ‘false‘, BLOCKSIZE => ‘65536‘, IN_MEMORY => ‘false‘,

BLOCKCACHE => ‘true‘}

1 row(s) in 0.5890 seconds

hbase(main):004:0>

　　　　5)将Hive的日志表，使用自增长Id做为rowkey导入到Hbase可识别的表中

　　#将Hive的日志表（nginx_accesslog）导入到表（h2b_nginx_accesslog）中，会自动同步到hbase。

insert overwrite table h2b_nginx_accesslog

select a.*

from (select

rowSequence(),

host,

hostuser,

times,

requestmethond,

requesturl,

requesthttp,

status,

body_bytes_sent,

referer,

useragent,

http_x_forwarded_for,

gzip_ratio,

request_time,

bytes_sent,

request_length,

upstream_addr,

upstream_status,

upstream_response_time

from nginx_accesslog) a;

　　#导入到中间表时运行的结果,如果你也能够看到和下面类似的日志，恭喜你成功了。

hive> insert overwrite table h2b_nginx_accesslog

> select a.*

> from (select

> rowSequence(),

> host,

> hostuser,

> times,

> requestmethond,

> requesturl,

> requesthttp,

> status,

> body_bytes_sent,

> referer,

> useragent,

> http_x_forwarded_for,

> gzip_ratio,

> request_time,

> bytes_sent,

> request_length,

> upstream_addr,

> upstream_status,

> upstream_response_time

> from nginx_accesslog) a;

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1408550631561_0017, Tracking URL = http://m1:8088/proxy/application_1408550631561_0017/

Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1408550631561_0017

Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0

2014-08-24 11:57:24,051 Stage-0 map = 0%, reduce = 0%

2014-08-24 11:57:32,403 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 1.96 sec

MapReduce Total cumulative CPU time: 1 seconds 960 msec

Ended Job = job_1408550631561_0017

MapReduce Jobs Launched:

Job 0: Map: 1 Cumulative CPU: 1.96 sec HDFS Read: 3734 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 960 msec

OK

Time taken: 20.378 seconds

hive>

　　　　6)到Hbase中验证效果使用get、scan、filter获取指定数据

　　#我们再登录到Hbase中看数据是否已经导入

hbase(main):013:0> get "h2b_nginx_accesslog",1

COLUMN CELL

log:body_bytes_sent timestamp=1408852652522, value=2373

log:bytes_sent timestamp=1408852652522, value=2683

log:gzip_ratio timestamp=1408852652522, value=2.78

log:host timestamp=1408852652522, value=8.8.8.8

log:hostuser timestamp=1408852652522, value=-

log:http_x_forwarded_for timestamp=1408852652522, value=-

log:referer timestamp=1408852652522, value=-

log:request_length timestamp=1408852652522, value=369

log:request_time timestamp=1408852652522, value=0.004

log:requesthttp timestamp=1408852652522, value=HTTP/1.1

log:requestmethond timestamp=1408852652522, value=GET

log:requesturl timestamp=1408852652522, value=/

log:status timestamp=1408852652522, value=200

log:times timestamp=1408852652522, value=22/Aug/2014:20:23:45 +0800

log:upstream_addr timestamp=1408852652522, value=unix:/var/run/php5-fpm.sock

log:upstream_response_time timestamp=1408852652522, value=0.004

log:upstream_status timestamp=1408852652522, value=200

log:useragent timestamp=1408852652522, value=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36

18 row(s) in 0.0440 seconds

hbase(main):015:0> get "h2b_nginx_accesslog",1,{COLUMN => ‘log:useragent‘}

COLUMN CELL

log:useragent timestamp=1408852652522, value=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36

1 row(s) in 0.0080 seconds

　　#使用Filter过滤requesturl列中包含p04.png的记录,使用之前要先导入子串

hbase(main):031:0> import org.apache.hadoop.hbase.filter.CompareFilter

hbase(main):032:0> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter

hbase(main):033:0> import org.apache.hadoop.hbase.filter.SubstringComparator

hbase(main):034:0> import org.apache.hadoop.hbase.util.Bytes

hbase(main):035:0> scan "h2b_nginx_accesslog",{FILTER => SingleColumnValueFilter.new(Bytes.toBytes(‘log‘),Bytes.toBytes(‘requesturl‘),CompareFilter::CompareOp.valueOf(‘EQUAL‘),SubstringComparator.new(‘p04.png‘))}

ROW COLUMN+CELL

9 column=log:body_bytes_sent, timestamp=1408852652522, value=2860

9 column=log:bytes_sent, timestamp=1408852652522, value=3169

9 column=log:gzip_ratio, timestamp=1408852652522, value=-

9 column=log:host, timestamp=1408852652522, value=8.8.8.8

9 column=log:hostuser, timestamp=1408852652522, value=-

9 column=log:http_x_forwarded_for, timestamp=1408852652522, value=-

9 column=log:referer, timestamp=1408852652522, value=http://xxx.com/

9 column=log:request_length, timestamp=1408852652522, value=376

9 column=log:request_time, timestamp=1408852652522, value=0.000

9 column=log:requesthttp, timestamp=1408852652522, value=HTTP/1.1

9 column=log:requestmethond, timestamp=1408852652522, value=GET

9 column=log:requesturl, timestamp=1408852652522, value=8.8.8.8/b519d8ca/image/p04.png

9 column=log:status, timestamp=1408852652522, value=200

9 column=log:times, timestamp=1408852652522, value=22/Aug/2014:20:23:45 +0800

9 column=log:upstream_addr, timestamp=1408852652522, value=-

9 column=log:upstream_response_time, timestamp=1408852652522, value=-

9 column=log:upstream_status, timestamp=1408852652522, value=-

9 column=log:useragent, timestamp=1408852652522, value=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Sa

fari/537.36

1 row(s) in 0.0320 seconds

hbase(main):036:0>

　　四、FAQ

　　1)如果导入的内容使用select查看时，发现是null，那可能是你的正则写的有问题，可以使用下面这个工具验证你的正则是否有问题。

　　#http://regex101.com，一个在线的正则表达式工具，很好用，如下图。

　　#在线测试通过后，将正则表达式，复制到一个文本文件中，进行两次转义字符的替换就可以使用了，第1次，将一个右斜杠去你的成两个右斜杠：‘\‘替换成‘\\‘；第二次将双引号替换成斜框双引号‘"‘替换成‘\"‘