Azure 云平台用 SQOOP 将 SQL server 2012 数据表导入 HIVE / HBASE

My name is Farooq and I am with HDinsight support team here at Microsoft. In this blog I will try to give some brief overview of Sqoop in HDinsight and then use an example of importing data from a Windows Azure SQL Database table to HDInsight cluster to demonstrate how you can get stated with Sqoop in HDInsight.

What is Sqoop?

Sqoop is an Apache project and part of Hadoop ecosystem. It allows data transfer between Hadoop\HDInsight cluster and relational databases such as SQL, Oracle, MySQL etc. Sqoop is a collection of related tools, for example import, export, list-all-tables, list-databases etc. To use Sqoop, you specify the tool you want to use and the arguments that control the tool. For more information on Sqoop please check Sqoop User Guide.

When do you need to use Sqoop?

You need to use Sqoop only when you are trying to import/export data between Hadoop and a relational Database. HDInsight provides a full-featured Hadoop Distributed File System (HDFS) over Windows Azure Blob storage (WABS) and if you want to upload data to HDInsight or WASB from any other source, for example from your local computer‘s file system then you should use any of the tools discussed in this article. The same article also discusses how to import data to HDFS from SQL Database/SQL Server using Sqoop. In this blog I will elaborate on the same with an example and try to provide more details information along the way.

What do I need to do for Sqoop to work in my HDInsight cluster?

HDInsight 2.1 includes Sqoop 1.4.3. The Microsoft SQL Server SQOOP Connector for Hadoop is now part of Apache SQOOP 1.4. So you do not need to install the connector separately. All HDInsight clusters also have Microsoft SQL Server JDBC driver installed; so all components that are needed to transfer data between HDInsight cluster and SQL server are already installed in a HDI cluster and you do not have to install anything.

How can I run a Sqoop job?

With HDInsight preview version we could only run the Sqoop commands from Hadoop command line after doing a remote desktop session (RDP) on the HDInsight cluster head node. However the release version of HDInsight SDK includes the PowerShell cmdlet to run Sqoop job remotely. So we can

  1. Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line
  2. Run Sqoop job remotely using HDInsight SDK PowerShell cmlets

We recommend that you run your Sqoop commands remotely using HDInsight SDK cmdlets . We will discuss both the options in detail. First let‘s see how we can run Sqoop jobs locally from HDInsight head node using Hadoop Command Line.

Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line

I am assuming you already have a Windows Azure SQL Database. If you don‘t and you want to get one please follow the steps in this article. Let‘s follow the steps below to create a test table and populate with some sample data in your Windows Azure SQL Database which we will import in our HDInsight cluster shortly. I will show how to do this from Windows Azure portal but you can also connect to the Windows Azure SQL Database from SSMS and do the same.

Note: if you want to transfer data from a SQL server on your environment instead then you need to change the Sqoop command with appropriate connection information and it should be very similar to the connection string I have provided later in this blog under More sample Sqoop commands‘ section for SQL server on Window Azure VM.

  1. Login to your Windows Azure Portal and select ‘SQL Databases‘ from the Left and click ‘Manage‘ at the bottom.

  2. Provider your Windows Azure SQL Database user ID and password to login and then click ‘New Query‘ to open a new query window to run T-SQL queries.

  3. Copy paste the following T-SQL query and execute to create a test table Table1.

    CREATE TABLE [dbo].[Table1](

    [ID] [int] NOT NULL,

    [FName] [nvarchar](50) NOT NULL,

    [LName] [nvarchar](50) NOT NULL,

    CONSTRAINT [PK_Table_4] PRIMARY KEY CLUSTERED

    (

    [ID] ASC

    )

    ) ON [PRIMARY]

    GO

  4. Run the Following to Populate Table1 with 4 rows.

    INSERT INTO [dbo].[Table1] VALUES (1,‘Jhon‘,‘Doe‘), (2,‘Harry‘,‘Hoe‘), (3, ‘Carla‘,‘Coe‘), (4,‘Jackie‘,‘Joe‘);

    GO

  5. Now finally run the following T-SQL to make sure that is table is populated with the sample data. You should see the output as below.

    SELECT * from [dbo].[Table1]

Now let‘s follow the steps below to Import the rows in Table1 to the HDInsight Cluster.

  1. Login to your HDInsight cluster head node via Remote Desktop (RDP) and double click the ‘Hadoop Command Line‘ icon in the desktop to open Hadoop Command Line. RDP access is turned off by default but you can follow the steps inthis blog to enable RDP and then RDP to the head node of your HDInsight cluster.
  2. In Hadoop Command Line please navigate to the "C:\apps\dist\sqoop-1.4.3.1.3.1.0-06\bin" folder.

    Note: Please verify the path for the Sqoop bin folder in your environment. It may slightly vary from version to version.
  3. Run the following Sqoop command to import all the rows of table "Table1" from  Windows Azure SQL Database "mfarooqSQLDB" to HDInsight Cluster.

    sqoop.cmd import –-connect "jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>;password=<SQLDatabasePassword>;database=<SQLDatabaseDatabaseName>" --table Table1 --target-dir /user/hdp/SqoopImportTable1

    Once the command is executed successfully you should see something similar as below in Hadoop Command Line window.

  4. There are quite a number of tools available to upload/download and view data in WASB. Let‘s use Azure Storage Explorer tool. You need to install the tool in your work station and configure for your cluster. Once all is done open the tool and find out /user/hdp/SqoopImportTable1 folder. You should see something similar as below. It shows 4 files indicating 4 map jobs were used. You can select a file and click the ‘View‘ button to see the actual text data.

Now let‘s export the same rows back to the SQL server from HDInsight cluster. Please use a different table with the same schema as ‘Table1‘. Otherwise you would get a Primary Key violation error since the rows already exist in ‘Table1‘.

  1. Create an empty table ‘Table2‘ with the same schema as ‘Table1‘.

    CREATE TABLE [dbo].[Table2](

    [ID] [int] NOT NULL,

    [FName] [nvarchar](50) NOT NULL,

    [LName] [nvarchar](50) NOT NULL,

    CONSTRAINT [PK_Table_2] PRIMARY KEY CLUSTERED

    (

    [ID] ASC

    )

    ) ON [PRIMARY]

    GO

  2. Run the following Sqoop command from Hadoop Command Line.

sqoop.cmd export --connect "jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>;password=<SQLDatabasePassword>;database=<SQLDatabaseDatabaseName>" --table Table2 --export-dir /user/hdp/SqoopImportTable1 --input-fields-terminated-by ","

More sample Sqoop commands:

Import from a SQL server on Window Azure VM:

sqoop.cmd import --connect "jdbc:sqlserver:// <WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table_1 --target-dir /user/hdp/SqoopImportTable

Export to a SQL server on Window Azure VM:

sqoop.cmd export --connect "jdbc:sqlserver://<WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table_2 --export-dir /user/hdp/SqoopImportTable2 --input-fields-terminated-by ","

Importing to HIVE from Windows Azure SQL Database:

C:\apps\dist\sqoop-1.4.2\bin>sqoop.cmd import –connect "jdbc:sqlserver://<WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table1 --hive-import

Note: This will store the files under hive/warehouse/TableName folder in HDFS (For example hive/warehouse/table1/part-m-00000 )

Run Sqoop job remotely using HDInsight SDK PowerShell cmlets

To use HDInsight PowerShell tools you need to install Windows Azure PowerShell tools first and then install HDInsight PowerShell tools. Then you need to prepare your workstation to use the HDInsight SDK. Please follow the detail steps in this earlier blog post to install the tools and prepare your work station to use the HDInsight SDK.

Once you have installed and configured Windows Azure PowerShell tools and HDInsight SDK running a Sqoop job is very easy. Please follow the steps below to import all the rows of table "Table2" from Windows Azure SQL Database "mfarooqSQLDB" to HDInsight Cluster.

  1. Open the Windows azure PowerShell console on the workstation and run the following cmdlets one at a time.

    Note: You can also use Windows Powershell ISE to type the code and run all at once. Powershell ISE makes edits easy and you can open the tool from "C:\Windows\System32\WindowsPowerShell\v1.0\powershell_ise.exe".

  2. Set the variables for your Windows Azure Subscription name and the HDInsight cluster name.

    $subscriptionName = "<WindowsAzureSubscriptionName>"

    $clusterName = "<HDInsightClusterName>"

    Select-AzureSubscription $subscriptionName

    Use-AzureHDInsightCluster $clusterName -Subscription $subscriptionName

  3. Define the Sqoop job that we want to run. In this exercise we will import all the rows of table "Table2" that we created earlier in Windows Azure SQL Database.

    $sqoop = New-AzureHDInsightSqoopJobDefinition -Command "import --connect jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>; password=<SQLDatabasePassword>; database=<SQLDatabaseDatabaseName> --table Table2 --target-dir /user/hdp/SqoopImportTable8"

  4. Run the Sqoop job that we just defined.

    $sqoopJob = Start-AzureHDInsightJob -Subscription $subscriptionName -Cluster $clusterName -JobDefinition $sqoop

  5. Run the following to wait for the completion or failure of the HDInsight job and show its progress.

    Wait-AzureHDInsightJob -Subscription $subscriptionName -WaitTimeoutInSeconds 3600 -Job $sqoopJob

  6. Run the following to retrieve the log output for a job from the storage account associated with a specified cluster.

    Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subscriptionName -StandardError -JobId $sqoopJob.JobId

If the Sqoop job completes successfully you should see something similar as below in your Windows Azure PowerShell command line window.

Troubleshooting tips

When you run a Sqoop job command it runs MapReduce job in Hadoop Cluster (map only and no reduce task). You can specify the number of map tasks but by default four tasks are used. There is no separate log file specific to Sqoop. So we need to troubleshoot Sqoop job failure or performance issues as any other MapReduce job failure or performance issues and start by checking the task logs.  I plan to write more on how to troubleshot Sqoop issues by focusing on some specific scenarios in the near future.

That‘s all for today and I hope you found this blog useful. I look forward to your comments and suggestions J.

时间: 2024-10-23 22:14:25

Azure 云平台用 SQOOP 将 SQL server 2012 数据表导入 HIVE / HBASE的相关文章

Sql Server删除数据表中重复记录 三种方法

本文介绍了Sql Server数据库中删除数据表中重复记录的方法. [项目]数据库中users表,包含u_name,u_pwd两个字段,其中u_name存在重复项,现在要实现把重复的项删除![分析]1.生成一张临时表new_users,表结构与users表一样:2.对users表按id做一个循环,每从users表中读出一个条记录,判断new_users中是否存在有相同的u_name,如果没有,则把它插入新表:如果已经有了相同的项,则忽略此条记录:3.把users表改为其它的名称,把new_use

[转]不同版本的SQL Server之间数据导出导入,降级还原等

鉴于大家经常遇到数据库不同版本之间的数据导出导入,降级还原等问题,Philo童鞋搜集了两则文章. 转载到此,希望对大家有用.对贡献者表示感谢! 不同版本的SQL Server之间数据导出导入的方法及性能比较 SQLServer数据库降级方法详解(百度经验)

C# 如何确定SQL Server 中数据表是否存在

SQL Server数据库的表名等信息属于架构集合的一部分,ADO.NET中的SqlConnection类包含的GetSchema 方法用于获取支持的架构集合列表,因此,要确定SQL Server 数据库中表是否存在是否存在,可通过SqlConnection.GetSchema("Tables")来获得,该方法返回一个DataTable,DataTable中包含table_catalog.table_schema.table_name.table_type等4列,table_name列

SQL Server 查看数据表占用空间大小的SQL语句

1 declare @name varchar(1000) 2 declare @sql varchar(1000) 3 4 if object_id('tempdb..#space') is not null drop table #space 5 create table #space(name varchar(50),rows bigint,reserved varchar(12),data varchar(12),index_size varchar(12),unused varchar

SQL Server批量数据导出导入Bulk Insert使用

简介 Bulk insert命令区别于BCP命令之处在于它是SQL server脚本语句,它可以将本地或远程的文件数据批量导入数据库,速度非常之快:远程文件必须共享才行, 文件路径须使用通用约定(UNC)名称,即"\\服务器名或IP\共享名\路径\文件名"的形式. 注意,这里的远程事相对数据库服务器而言,即数据文件若放置在数据库服务器之外,则需要共享给数据库服务器: Bulk Insert通常配合BCP导出的格式化文件批量导入数据 Bulk Insert配合格式文件语法 Bulk in

SQL Server批量数据导出导入BCP使用

BCP简介 bcp是SQL Server中负责导入导出数据的一个命令行工具,它是基于DB-Library的,并且能以并行的方式高效地导入导出大批量的数据.bcp可以将数据库的表或视图直接导出,也能通过SELECT FROM语句对表或视图进行过滤后导出.在导入导出数据时,可以使用默认值或是使用一个格式文件将文件中的数据导入到数据库或将数据库中的数据导出到文件中 BCP执行方式 bcp通过控制台命令行执行 通过调用SQL Server的一个系统存储过程xp_cmdshell以SQL语句的方式运行 *

使用SQL Server 2012创建表

一.使用图形化界面创建表 在管理平台中,展开指定的服务器和数据库,打开想要创建新表的数据库,用右键单击“表”对象,从弹出的快捷菜单中选择“新建表”选项,就会出现“新建表”窗口,如图所示,在该窗口中,可以定义列名.数据类型.是否允许空.及其它相关属性等. 二.使用Transact-SQL语句建表 输入SQL命令 写完代码后,首先点击“√”以验证语法是否正确,然后点击“!执行”,执行SQL语句. 如果没有出错,将会有提示 原文地址:https://www.cnblogs.com/cyx-b/p/12

SQL Server 2012 - 多表连接查询

-- 交叉连接产生笛卡尔值 (X*Y) SELECT * FROM Student cross Join dbo.ClassInfo --另外一种写法 SELECT * FROM Student , ClassInfo -- 内连接 (Inner 可以省略) SELECT * FROM Student JOIN dbo.ClassInfo ON dbo.Student.Class = dbo.ClassInfo.ID; -- Inner Join SELECT * FROM Student IN

SQL Server 的数据表简单操作

--创建数据表--[use 要创建数据表的数据库名称go]create table 要创建的表名(字段名 数据类型[长度] [null | not null] [primary key],... ... ... ... ,字段名 数据类型[长度] [null | not null]) 例:use 商品管理数据库gocreate table 客户信息表(客户编号 nchar(8) not null,客户姓名 nvarchar(5) not null,联系电话 nvarchar(11) not nu