概要
为了调查hadoop生态圈里的制品,特地的了解了一下RDBMS和hdfs之间数据的导入和导出工具,并且调查了一些其他同类的产品,得出来的结论是:都是基于sqoop做的二次开发或者说是webUI包装,实质还是用的sqoop。比如pentaho的PDI,Oracle的ODI,都是基于此,另外,Hortnetwork公司的sandbox,Hue公司的Hue webUI,coulder的coulder manger,做个就更不错了,差不多hadoop下的制品都集成了,部署也不是很复杂,还是很强大的。
关于sqoop
apache sqoop现阶段分了2个系列制品,一个是sqoop1系列的,另一个是sqoop2系列的。相比较,sqoop1相对比较成熟,bug较少,但结构比较单一,现阶段的稳定版是1.4.6;sqoop2系列基于sqoop1的基础上,做了很大的改进,client跟server端分离,job跟connection做到了集成化管理,使用方面来看,比sqoop1简单多了,但部署比较复杂,且sqoop1不能跟sqoop2兼容,既存的一些应用脚本几乎要重写,但大的趋势来看,sqoop2将会变成主流。
#sqoop2从1.99.2以后,就没法将数据导入到hbase中,这一点,以后预定会在sqoop2.0.0这个稳定版中解决掉。
环境搭建
环境搭建依据官网的提示,这里着重说一下需要注意的是事项:
1.server/conf/sqoop.properties文件中需要修改的地方
org.apache.sqoop.repository.jdbc.url=jdbc:derby:@[email protected]/repository/sqoop;create=true
这里的sqoop是事先在mysql这边创建的数据库,并赋予了权限:
create database sqoop ; create user sqoop identified by '123456'; grant privileges on sqoop.* to sqoop; flush privileges;
2.也是在sqoop.properties文件中修改hadoop的位置
org.apache.sqoop.Submission.engine.mapreduce.configuration.directory=your-hadoop-cluster-location
3.server/conf/catalina.properties文件中,追加hadoop/share下的所有lib文件。
common.loader=${Catalina.base}/lib,${CAtalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,your-hadoop-libs
4.【重要】修改hadoop的yarn-site.xml文件,追加如下信息:
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>
测试
hadoop跟sqoop环境启动。
1.启动hadoop start-all.sh脚本
2.启动sqoop
1.sqoop server以demaon启动后,会有如下信息:
[[email protected] sqoop-1.99.3-bin-hadoop200]# ./bin/sqoop.sh server run Sqoop home directory: /home/project/sqoop-1.99.3-bin-hadoop200 Setting SQOOP_HTTP_PORT: 12000 Setting SQOOP_ADMIN_PORT: 12001 Using CATALINA_OPTS: Adding to CATALINA_OPTS: -Dsqoop.http.port=12000 -Dsqoop.admin.port=12001 Using CATALINA_BASE: /home/project/sqoop-1.99.3-bin-hadoop200/server Using CATALINA_HOME: /home/project/sqoop-1.99.3-bin-hadoop200/server Using CATALINA_TMPDIR: /home/project/sqoop-1.99.3-bin-hadoop200/server/temp Using JRE_HOME: /usr/java/jdk1.7.0_67 Using CLASSPATH: /home/project/sqoop-1.99.3-bin-hadoop200/server/bin/bootstrap.jar May 11, 2016 6:56:00 PM org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib May 11, 2016 6:56:00 PM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-12000 May 11, 2016 6:56:00 PM org.apache.catalina.startup.Catalina load INFO: Initialization processed in 634 ms May 11, 2016 6:56:00 PM org.apache.catalina.core.StandardService start INFO: Starting service Catalina May 11, 2016 6:56:00 PM org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.36 May 11, 2016 6:56:00 PM org.apache.catalina.startup.HostConfig deployWAR INFO: Deploying web application archive sqoop.war 2016-05-11 18:56:00,972 INFO [main] core.SqoopServer (SqoopServer.java:initialize(47)) - Booting up Sqoop server 2016-05-11 18:56:00,979 INFO [main] core.PropertiesConfigurationProvider (PropertiesConfigurationProvider.java:initialize(96)) - Starting config file poller thread log4j: Parsing for [root] with value=[WARN, file]. log4j: Level token is [WARN]. log4j: Category root set to WARN log4j: Parsing appender named "file". log4j: Parsing layout options for "file". log4j: Setting property [conversionPattern] to [%d{ISO8601} %-5p %c{2} [%l] %m%n]. log4j: End of parsing for "file". log4j: Setting property [file] to [@[email protected]/sqoop.log]. log4j: Setting property [maxBackupIndex] to [5]. log4j: Setting property [maxFileSize] to [25MB]. log4j: setFile called: @[email protected]/sqoop.log, true log4j: setFile ended log4j: Parsed "file" options. log4j: Parsing for [org.apache.sqoop] with value=[DEBUG]. log4j: Level token is [DEBUG]. log4j: Category org.apache.sqoop set to DEBUG log4j: Handling log4j.additivity.org.apache.sqoop=[null] log4j: Parsing for [org.apache.derby] with value=[INFO]. log4j: Level token is [INFO]. log4j: Category org.apache.derby set to INFO log4j: Handling log4j.additivity.org.apache.derby=[null] log4j: Finished configuring. log4j: Could not find root logger information. Is this OK? log4j: Parsing for [default] with value=[INFO,defaultAppender]. log4j: Level token is [INFO]. log4j: Category default set to INFO log4j: Parsing appender named "defaultAppender". log4j: Parsing layout options for "defaultAppender". log4j: Setting property [conversionPattern] to [%d %-5p %c: %m%n]. log4j: End of parsing for "defaultAppender". log4j: Setting property [file] to [@[email protected]/default.audit]. log4j: setFile called: @[email protected]/default.audit, true log4j: setFile ended log4j: Parsed "defaultAppender" options. log4j: Handling log4j.additivity.default=[null] log4j: Finished configuring. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/project/sqoop-1.99.3-bin-hadoop200/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/project/sqoop-1.99.3-bin-hadoop200/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/project/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] May 11, 2016 6:56:03 PM org.apache.catalina.startup.HostConfig deployDirectory INFO: Deploying web application directory ROOT May 11, 2016 6:56:03 PM org.apache.coyote.http11.Http11Protocol start INFO: Starting Coyote HTTP/1.1 on http-12000 May 11, 2016 6:56:03 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 3605 ms
使用jps命令查看的话,会有一个bootstrap的进程,这个就可以证明sqoop server启动成功。
2.启动sqoop client
命令:sqoop.sh client
[[email protected] sqoop-1.99.3-bin-hadoop200]# ./bin/sqoop.sh client Sqoop home directory: /home/project/sqoop-1.99.3-bin-hadoop200 Sqoop Shell: Type 'help' or '\h' for help. sqoop:000>
3.测试准备以及实施
确认版本信息
sqoop:000> show version -all client version: Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013 server version: Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013 Protocol version: [1]
创建server,面向web UI
set server --host localhost --port 12000 --webapp sqoop
sqoop:000> set server --host localhost --port 12000 --webapp sqoop Server is set successfully
创建connection,成功的话如下:
sqoop:000> create connection --cid 2 Creating connection for connector with id 2 Exception has occurred during processing command Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception sqoop:000> create connection --cid 1 Creating connection for connector with id 1 Please fill following values to create new connection object Name: test-mysql2hdfs Connection configuration JDBC Driver Class: com.mysql.jdbc.Driver JDBC Connection String: jdbc:mysql://<strong>your-mysql-ip</strong>:3306/<strong>sqoop</strong> Username: <strong>sqoop</strong> Password: <strong>******</strong> JDBC Connection Properties: There are currently 0 values in the map: entry# Security related configuration options Max connections: 10 New connection was successfully created with validation status FINE and persistent id 6
#加粗的部分要注意,是事前在mysql出准备的,且sqoop数据库跟sqoop.properties里的也要保持一致。
此处创建成功的connection是id=6
根据创建成功的connection来创建job【mysql-->HDFS】,如下
sqoop:000> create job --xid 6 --type import Creating job for connection with id 6 Please fill following values to create new job object Name: importmysql2hdfs Database configuration Schema name: sqoop Table name: t1 Table SQL statement: Table column names: Partition column name: id Nulls in partition column: Boundary query: Output configuration Storage type: 0 : HDFS Choose: 0 Output format: 0 : TEXT_FILE 1 : SEQUENCE_FILE Choose: 0 Compression format: 0 : NONE 1 : DEFAULT 2 : DEFLATE 3 : GZIP 4 : BZIP2 5 : LZO 6 : LZ4 7 : SNAPPY Choose: 0 Output directory: /sqoopuse Throttling resources Extractors: Loaders: New job was successfully created with validation status FINE and persistent id 4
注意:表定义信息以及数据也是在mysql侧事前准备的。
mysql> select * from t1; +------+---------+----------+ | id | int_col | char_col | +------+---------+----------+ | 2 | 2 | b | | 4 | 4 | d | | 1 | 1 | a | | 3 | 3 | c | +------+---------+----------+ 4 rows in set (0.00 sec)
导入到hdfs的数据,存储在【Output directory: /sqoopuse】
且job的id=4
创建job【hdfs-->mysql】
sqoop:000> create job --xid 4 --type export Creating job for connection with id 4 Please fill following values to create new job object Name: hdfs2mysqlInfo Database configuration Schema name: sqoop Table name: t1 Table SQL statement: Table column names: Stage table name: Clear stage table: Input configuration Input directory: /sqoopuse Throttling resources Extractors: Loaders: New job was successfully created with validation status FINE and persistent id 11
4.测试实施
1.启动job【mysql-->hdfs】
sqoop:000> start job --jid 4 Submission details Job ID: 4 Server URL: http://localhost:12000/sqoop/ Created by: root Creation date: 2016-05-11 19:19:53 JST Lastly updated by: root External ID: job_1462962692840_0001 http://sv004:8088/proxy/application_1462962692840_0001/ 2016-05-11 19:19:53 JST: BOOTING - Progress is not available
2.sleep30s,查看job状态
sqoop:000> status job --jid 4 Submission details Job ID: 4 Server URL: http://localhost:12000/sqoop/ Created by: root Creation date: 2016-05-11 19:37:16 JST Lastly updated by: root External ID: job_1462962692840_0001 http://sv004:8088/proxy/application_1462962692840_0001/ 2016-05-11 19:37:57 JST: SUCCEEDED Counters: org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_MAPS: 38212 MB_MILLIS_MAPS: 39129088 TOTAL_LAUNCHED_MAPS: 3 MILLIS_MAPS: 38212 VCORES_MILLIS_MAPS: 38212 SLOTS_MILLIS_REDUCES: 0 OTHER_LOCAL_MAPS: 3 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ: 0 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN: 32 org.apache.hadoop.mapreduce.TaskCounter MAP_INPUT_RECORDS: 0 MERGED_MAP_OUTPUTS: 0 PHYSICAL_MEMORY_BYTES: 497262592 SPILLED_RECORDS: 0 FAILED_SHUFFLE: 0 CPU_MILLISECONDS: 3520 COMMITTED_HEAP_BYTES: 603979776 VIRTUAL_MEMORY_BYTES: 2741444608 MAP_OUTPUT_RECORDS: 4 SPLIT_RAW_BYTES: 346 GC_TIME_MILLIS: 96 org.apache.hadoop.mapreduce.FileSystemCounter FILE_READ_OPS: 0 FILE_WRITE_OPS: 0 FILE_BYTES_READ: 0 FILE_LARGE_READ_OPS: 0 HDFS_BYTES_READ: 346 FILE_BYTES_WRITTEN: 318117 HDFS_LARGE_READ_OPS: 0 HDFS_BYTES_WRITTEN: 32 HDFS_READ_OPS: 12 HDFS_WRITE_OPS: 6 org.apache.sqoop.submission.counter.SqoopCounters ROWS_READ: 4 Job executed successfully
3.查看出力在hdfs存储的位置文件
[[email protected] bin]# ./hadoop fs -ls /sqoopuse 16/05/11 19:43:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 4 items -rw-r--r-- 3 root supergroup 0 2016-05-11 19:37 /sqoopuse/_SUCCESS -rw-r--r-- 3 root supergroup 8 2016-05-11 19:37 /sqoopuse/part-m-00000 -rw-r--r-- 3 root supergroup 8 2016-05-11 19:37 /sqoopuse/part-m-00001 -rw-r--r-- 3 root supergroup 16 2016-05-11 19:37 /sqoopuse/part-m-00002
4.确认导入 数据
[[email protected] bin]# ./hadoop fs -cat /sqoopuse/part* 16/05/11 19:43:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 1,1,'a' 2,2,'b' 4,4,'d' 3,3,'c'
这个信息跟mysql中的数据是一致的,可以证明数据在导入的过程中是没有发生丢失的。
测试job【hdfs-->mysql】
1.清空mysql侧的数据
mysql> select * from t1; +------+---------+----------+ | id | int_col | char_col | +------+---------+----------+ | 2 | 2 | b | | 4 | 4 | d | | 1 | 1 | a | | 3 | 3 | c | +------+---------+----------+ 4 rows in set (0.00 sec) mysql> delete from t1; Query OK, 4 rows affected (0.27 sec) mysql> select * from t1; Empty set (0.00 sec) mysql>
2.启动job【hdfs-->mysql】
sqoop:000> start job --jid 11 Submission details Job ID: 11 Server URL: http://localhost:12000/sqoop/ Created by: root Creation date: 2016-05-11 19:50:42 JST Lastly updated by: root External ID: job_1462962692840_0002 http://sv004:8088/proxy/application_1462962692840_0002/ 2016-05-11 19:50:42 JST: BOOTING - Progress is not available
3.查看job运行状态
sqoop:000> status job --jid 11 Submission details Job ID: 11 Server URL: http://localhost:12000/sqoop/ Created by: root Creation date: 2016-05-11 19:50:42 JST Lastly updated by: root External ID: job_1462962692840_0002 http://sv004:8088/proxy/application_1462962692840_0002/ 2016-05-11 19:51:39 JST: SUCCEEDED Counters: org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_MAPS: 204363 MB_MILLIS_MAPS: 209267712 TOTAL_LAUNCHED_MAPS: 8 MILLIS_MAPS: 204363 VCORES_MILLIS_MAPS: 204363 SLOTS_MILLIS_REDUCES: 0 OTHER_LOCAL_MAPS: 8 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN: 0 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ: 0 org.apache.hadoop.mapreduce.TaskCounter MAP_INPUT_RECORDS: 0 MERGED_MAP_OUTPUTS: 0 PHYSICAL_MEMORY_BYTES: 1327665152 SPILLED_RECORDS: 0 COMMITTED_HEAP_BYTES: 1610612736 CPU_MILLISECONDS: 7590 FAILED_SHUFFLE: 0 VIRTUAL_MEMORY_BYTES: 7262990336 SPLIT_RAW_BYTES: 1224 MAP_OUTPUT_RECORDS: 4 GC_TIME_MILLIS: 316 org.apache.hadoop.mapreduce.FileSystemCounter FILE_WRITE_OPS: 0 FILE_READ_OPS: 0 FILE_LARGE_READ_OPS: 0 FILE_BYTES_READ: 0 HDFS_BYTES_READ: 1320 FILE_BYTES_WRITTEN: 839664 HDFS_LARGE_READ_OPS: 0 HDFS_WRITE_OPS: 0 HDFS_READ_OPS: 32 HDFS_BYTES_WRITTEN: 0 org.apache.sqoop.submission.counter.SqoopCounters ROWS_READ: 4 Job executed successfully
4.mysql客户端确认是否已经成功导出,且数据是否有丢失
mysql> select * from t1; +------+---------+----------+ | id | int_col | char_col | +------+---------+----------+ | 2 | 2 | b | | 4 | 4 | d | | 1 | 1 | a | | 3 | 3 | c | +------+---------+----------+ 4 rows in set (0.00 sec) mysql> delete from t1; Query OK, 4 rows affected (0.27 sec) mysql> select * from t1; Empty set (0.00 sec) mysql> select * from t1; <--------select确认 +------+---------+----------+ | id | int_col | char_col | +------+---------+----------+ | 1 | 1 | a | | 2 | 2 | b | | 4 | 4 | d | | 3 | 3 | c | +------+---------+----------+ 4 rows in set (0.00 sec)
导出成功,且没有发生数据丢失。
---over----