sqoop导入数据''--query搭配$CONDITIONS''的理解

sqoop在导入数据时,可以使用--query搭配sql来指定查询条件,并且还需在sql中添加\$CONDITIONS,来实现并行运行mr的功能。

运行测试

测试均基于sqoop1,mysql数据准备如下。

(1)只要有--query+sql,就需要加\$CONDITIONS,哪怕只有一个maptask。

# 只有一个maptask[[email protected] /kkb/bin]$ sqoop import --connect jdbc:mysql://node01:3306/sqooptest --username root --password 123456 --target-dir /sqoop/conditiontest --delete-target-dir --query ‘select * from Person where score>50‘ --m 1
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
20/02/07 10:43:20 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.14.2
20/02/07 10:43:20 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/02/07 10:43:20 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/02/07 10:43:20 INFO tool.CodeGenTool: Beginning code generation# 提示需要添加$CONDITIONS参数
20/02/07 10:43:20 ERROR tool.ImportTool: Import failed: java.io.IOException: Query [select * from Person where score>50] must contain ‘$CONDITIONS‘ in WHERE clause.
    at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:332)
    at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1858)
    at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1657)
    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:106)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:494)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)

You have new mail in /var/spool/mail/root

(2)如果只有一个maptask,可以不加--split-by来区分数据,因为处理的是整份数据,无需切分。

[[email protected] /kkb/bin]$ sqoop import --connect jdbc:mysql://node01:3306/sqooptest --username root --password 123456 --target-dir /sqoop/conditiontest --delete-target-dir --query ‘select * from Person where score>50 and $CONDITIONS‘ --m 1
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
20/02/07 10:44:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.14.2
20/02/07 10:44:08 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/02/07 10:44:08 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/02/07 10:44:08 INFO tool.CodeGenTool: Beginning code generation
Fri Feb 07 10:44:08 CST 2020 WARN: Establishing SSL connection without server‘s identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn‘t set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to ‘false‘. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
20/02/07 10:44:09 INFO manager.SqlManager: Executing SQL statement: select * from Person where score>50 and  (1 = 0)
20/02/07 10:44:09 INFO manager.SqlManager: Executing SQL statement: select * from Person where score>50 and  (1 = 0)
20/02/07 10:44:09 INFO manager.SqlManager: Executing SQL statement: select * from Person where score>50 and  (1 = 0)
20/02/07 10:44:09 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /kkb/install/hadoop-2.6.0-cdh5.14.2
Note: /tmp/sqoop-hadoop/compile/ed46f1d7a2aeb419d32799b64aabdc82/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
20/02/07 10:44:13 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/ed46f1d7a2aeb419d32799b64aabdc82/QueryResult.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/02/07 10:44:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/02/07 10:44:14 INFO tool.ImportTool: Destination directory /sqoop/conditiontest is not present, hence not deleting.
20/02/07 10:44:14 INFO mapreduce.ImportJobBase: Beginning query import.
20/02/07 10:44:14 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/02/07 10:44:15 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/02/07 10:44:15 INFO client.RMProxy: Connecting to ResourceManager at node01/192.168.200.100:8032
Fri Feb 07 10:44:21 CST 2020 WARN: Establishing SSL connection without server‘s identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn‘t set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to ‘false‘. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
20/02/07 10:44:21 INFO db.DBInputFormat: Using read commited transaction isolation
20/02/07 10:44:21 INFO mapreduce.JobSubmitter: number of splits:1
20/02/07 10:44:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1581042804134_0001
20/02/07 10:44:22 INFO impl.YarnClientImpl: Submitted application application_1581042804134_0001
20/02/07 10:44:22 INFO mapreduce.Job: The url to track the job: http://node01:8088/proxy/application_1581042804134_0001/
20/02/07 10:44:22 INFO mapreduce.Job: Running job: job_1581042804134_0001
20/02/07 10:44:33 INFO mapreduce.Job: Job job_1581042804134_0001 running in uber mode : true
20/02/07 10:44:33 INFO mapreduce.Job:  map 0% reduce 0%
20/02/07 10:44:36 INFO mapreduce.Job:  map 100% reduce 0%
20/02/07 10:44:36 INFO mapreduce.Job: Job job_1581042804134_0001 completed successfully
20/02/07 10:44:36 INFO mapreduce.Job: Counters: 32
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=0
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=100
        HDFS: Number of bytes written=181519
        HDFS: Number of read operations=128
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=6
    Job Counters
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=0
        Total time spent by all reduces in occupied slots (ms)=0
        TOTAL_LAUNCHED_UBERTASKS=1
        NUM_UBER_SUBMAPS=1
        Total time spent by all map tasks (ms)=1825
        Total vcore-milliseconds taken by all map tasks=0
        Total megabyte-milliseconds taken by all map tasks=0
    Map-Reduce Framework
        Map input records=1
        Map output records=1
        Input split bytes=87
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=44
        CPU time spent (ms)=950
        Physical memory (bytes) snapshot=160641024
        Virtual memory (bytes) snapshot=3020165120
        Total committed heap usage (bytes)=24555520
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=21
20/02/07 10:44:36 INFO mapreduce.ImportJobBase: Transferred 177.2646 KB in 21.4714 seconds (8.2558 KB/sec)
20/02/07 10:44:36 INFO mapreduce.ImportJobBase: Retrieved 1 records.

处理完结果合理。

(3) 如果只有多个maptask,需使用--split-by来区分数据,\$CONDITIONS替换查询范围。

# 指定了2个maptask[[email protected] /kkb/bin]$ sqoop import --connect jdbc:mysql://node01:3306/sqooptest --username root --password 123456 --target-dir /sqoop/conditiontest --delete-target-dir --query ‘select * from Person where score>1 and $CONDITIONS‘ --m 2 --split-by id
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /kkb/install/sqoop-1.4.6-cdh5.14.2/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
20/02/07 12:19:45 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.14.2
20/02/07 12:19:45 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/02/07 12:19:45 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/02/07 12:19:45 INFO tool.CodeGenTool: Beginning code generation
Fri Feb 07 12:19:45 CST 2020 WARN: Establishing SSL connection without server‘s identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn‘t set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to ‘false‘. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
20/02/07 12:19:46 INFO manager.SqlManager: Executing SQL statement: select * from Person where score>1 and  (1 = 0)
20/02/07 12:19:46 INFO manager.SqlManager: Executing SQL statement: select * from Person where score>1 and  (1 = 0)
20/02/07 12:19:46 INFO manager.SqlManager: Executing SQL statement: select * from Person where score>1 and  (1 = 0)
20/02/07 12:19:46 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /kkb/install/hadoop-2.6.0-cdh5.14.2
Note: /tmp/sqoop-hadoop/compile/c4d8789310abaa32f4e44e03103924e6/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
20/02/07 12:19:49 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/c4d8789310abaa32f4e44e03103924e6/QueryResult.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/02/07 12:19:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/02/07 12:19:50 INFO tool.ImportTool: Destination directory /sqoop/conditiontest deleted.
20/02/07 12:19:50 INFO mapreduce.ImportJobBase: Beginning query import.
20/02/07 12:19:50 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/02/07 12:19:50 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/02/07 12:19:50 INFO client.RMProxy: Connecting to ResourceManager at node01/192.168.200.100:8032
Fri Feb 07 12:19:55 CST 2020 WARN: Establishing SSL connection without server‘s identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn‘t set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to ‘false‘. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
20/02/07 12:19:55 INFO db.DBInputFormat: Using read commited transaction isolation# 查询id范围,最大id,最小id
20/02/07 12:19:55 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(id), MAX(id) FROM (select * from Person where score>1 and  (1 = 1) ) AS t1# id的范围是1-5,2个分片20/02/07 12:19:55 INFO db.IntegerSplitter: Split size: 2; Num splits: 2 from: 1 to: 5
20/02/07 12:19:55 INFO mapreduce.JobSubmitter: number of splits:2
20/02/07 12:19:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1581042804134_0002
20/02/07 12:19:56 INFO impl.YarnClientImpl: Submitted application application_1581042804134_0002
20/02/07 12:19:56 INFO mapreduce.Job: The url to track the job: http://node01:8088/proxy/application_1581042804134_0002/
20/02/07 12:19:56 INFO mapreduce.Job: Running job: job_1581042804134_0002
20/02/07 12:20:04 INFO mapreduce.Job: Job job_1581042804134_0002 running in uber mode : true
20/02/07 12:20:04 INFO mapreduce.Job:  map 0% reduce 0%
20/02/07 12:20:06 INFO mapreduce.Job:  map 50% reduce 0%
20/02/07 12:20:07 INFO mapreduce.Job:  map 100% reduce 0%
20/02/07 12:20:08 INFO mapreduce.Job: Job job_1581042804134_0002 completed successfully
20/02/07 12:20:09 INFO mapreduce.Job: Counters: 32
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=0
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=315
        HDFS: Number of bytes written=370770
        HDFS: Number of read operations=260
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=14
    Job Counters
        Launched map tasks=2
        Other local map tasks=2
        Total time spent by all maps in occupied slots (ms)=0
        Total time spent by all reduces in occupied slots (ms)=0
        TOTAL_LAUNCHED_UBERTASKS=2
        NUM_UBER_SUBMAPS=2
        Total time spent by all map tasks (ms)=2861
        Total vcore-milliseconds taken by all map tasks=0
        Total megabyte-milliseconds taken by all map tasks=0
    Map-Reduce Framework
        Map input records=5
        Map output records=5
        Input split bytes=189
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=69
        CPU time spent (ms)=1210
        Physical memory (bytes) snapshot=327987200
        Virtual memory (bytes) snapshot=6043004928
        Total committed heap usage (bytes)=47128576
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=110
20/02/07 12:20:09 INFO mapreduce.ImportJobBase: Transferred 362.0801 KB in 18.2879 seconds (19.7989 KB/sec)
20/02/07 12:20:09 INFO mapreduce.ImportJobBase: Retrieved 5 records.

查看数据发现,可以理解第一个\$CONDITIONS 条件被替换1<=id<=2,第二个\$CONDITIONS 条件被替换2<id<=5。

原理理解

当sqoop使用--query+sql执行多个maptask并行运行导入数据时,每个maptask将执行一部分数据的导入,原始数据需要使用‘--split-by 某个字段‘来切分数据,不同的数据交给不同的maptask去处理。maptask执行sql副本时,需要在where条件中添加$CONDITIONS条件,这个是linux系统的变量,可以根据sqoop对边界条件的判断,来替换成不同的值,这就是说若split-by id,则sqoop会判断id的最小值和最大值判断id的整体区间,然后根据maptask的个数来进行区间拆分,每个maptask执行一定id区间范围的数值导入任务,如下为示意图。

以上,是在参考文末博文的基础上,对\$CONDITIONS的理解,后续继续完善补充。

参考博文:

(1)https://www.cnblogs.com/kouryoushine/p/7814312.html

sqoop导入数据''--query搭配$CONDITIONS''的理解

原文地址:https://www.cnblogs.com/youngchaolin/p/12271211.html

时间: 2024-10-03 21:54:19

sqoop导入数据''--query搭配$CONDITIONS''的理解的相关文章

sqoop导入数据时间日期类型错误

一个问题困扰了很久,用sqoop import从mysql数据库导入到HDFS中的时候一直报错,最后才发现是一个时间日期类型的非法值导致. hive只支持timestamp类型,而mysql中的日期类型是datetime, 当datetime的值为0000-00-00 00:00:00的时候,sqoop import成功,但是在hive中执行select语句查询该字段的时候报错. 解决方法是在创建hive表时用string字段类型. sqoop导入数据时间日期类型错误,布布扣,bubuko.co

sqoop 导入数据到HDFS注意事项

今天碰到不少问题,记录一下. 分割符的方向问题 首先sqoop的参数要小心, 从数据库导出数据,写到HDFS的文件中的时候,字段分割符号和行分割符号必须要用 --fields-terminated-by 而不能是 --input-fields-terminated-by --input前缀的使用于读文件的分割符号,便于解析文件,所以用于从HDFS文件导出到某个数据库的场景. 两个方向不一样. 参数必须用单引号括起来 官方文档的例子是错的: The octal representation of

Sqoop导入数据到Hadoop代理执行

最近在做执行服务器,它根据用户输入的sqoop命令代理向hadoop提交任务执行,目前需要支持的数据源包括mysql.oracle以及公司自己的分布式数据库DDB,数据导入的目的地可以是HDFS或者hive表. 首先来讨论一下对hive的支持,hive是作为一个支持JDBC的数据库,它的数据分成两部分,元数据和数据,元数据保存在一个本地的数据库,例如嵌入式数据库derby或者mysql,主要是存储一些关于hive的数据库和表定义的一些信息(关于元数据库表需要补充一下,这些表的创建都是hive完成

关于在sqoop导入数据的时候,数据量变多的解决方案。

今天使用sqoop导入一张表,我去查数据库当中的数据量为650条数据,但是我将数据导入到hive表当中的时候出现了563条数据,这就很奇怪了,我以为是数据错了,然后多导入了几次数据发现还是一样的问题. 然后我去查数据字段ID的值然后发现建了主键的数据怎么可能为空的那.然后我去看数据库当中的数据发现,数据在存入的时候不知道加入了什么鬼东西,导致数据从哪一行截断了,导致多出现了三条数据.下面是有问题的字段. 这里我也不知道数据为啥会是这样,我猜想是在导入数据的时候hive默认行的分割符号是按照\n的

1.11-1.12 Sqoop导入数据时两种增量方式导入及direct

一.增量数据的导入 1.两种方式 ## query 有一个唯一标识符,通常这个表都有一个字段,类似于插入时间createtime where createtime => 20150924000000000 and createtime < 20150925000000000 ##sqoop参数 Incremental import arguments: --check-column <column> Source column to check for incremental ch

sqoop导入数据常见问题解决方法

ERROR Could not register mbeans java.security.AccessControlException: access denied ... https://blog.csdn.net/weixin_39445556/article/details/80802459 环境变量处理: hive出现错误ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. https:/

教程 | 使用Sqoop从MySQL导入数据到Hive和HBase

基础环境 sqoop:sqoop-1.4.5+cdh5.3.6+78, hive:hive-0.13.1+cdh5.3.6+397, hbase:hbase-0.98.6+cdh5.3.6+115 Sqool和Hive.HBase简介 Sqoop Sqoop是一个用来将Hadoop和关系型数据库中的数据相互转移的开源工具,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中. Hiv

sqoop导入增量数据

使用sqoop导入增量数据. 核心参数 --check-column 用来指定一些列,这些列在增量导入时用来检查这些数据是否作为增量数据进行导入,和关系行数据库中的自增字段及时间戳类似这些被指定的列的类型不能使用任意字符类型,如char.varchar等类型都是不可以的,同时 --check-column 可以去指定多个列 --incremental 用来指定增量导入的模式,两种模式分别为append 和 lastmodified --last-value 指定上一次导入中检查列指定字段的最大值

mysql通过sqoop导入到hbase中时数据量为1000w时出现Incorrect key file for table &#39;/tmp/#sql_458_0.MYI&#39;; try to repair it

问题:mysql通过sqoop导入到hbase中时数据量为1000w时出现Incorrect key file for table '/tmp/#sql_458_0.MYI'; try to repair it,数据量为100w等时没该问题 分析:出现该问题时因为mysql的临时目录(默认为/tmp)太小 解决方法:参考:http://blog.sina.com.cn/s/blog_4c197d420101bdn9.html mysql通过sqoop导入到hbase中时数据量为1000w时出现I