Build an ETL Pipeline With Kafka Connect via JDBC Connectors

This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via JDBC connections.

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

Tutorial: Discover how to build a pipeline with Kafka leveraging DataDirect PostgreSQL JDBC driver to move the data from PostgreSQL to HDFS. Let’s go streaming!

Apache Kafka is an open source distributed streaming platform which enables you to build streaming data pipelines between different applications. You can also build real-time streaming applications that interact with streams of data, focusing on providing a scalable, high throughput and low latency platform to interact with data streams.

Earlier this year, Apache Kafka announced a new tool called Kafka Connect which can helps users to easily move datasets in and out of Kafka using connectors, and it has support for JDBC connectors out of the box! One of the major benefits for DataDirect customers is that you can now easily build an ETL pipeline using Kafka leveraging your DataDirect JDBC drivers. Now you can easily connect and get the data from your data sources into Kafka and export the data from there to another data source.

Image From https://kafka.apache.org/

Environment Setup

Before proceeding any further with this tutorial, make sure that you have installed the following and are configured properly. This tutorial is written assuming you are also working on Ubuntu 16.04 LTS, you have PostgreSQL, Apache Hadoop, and Hive installed.

  1. Installing Apache Kafka and required toolsTo make the installation process easier for people trying this out for the first time, we will be installing Confluent Platform. This takes care of installing Apache Kafka, Schema Registry and Kafka Connect which includes connectors for moving files, JDBC connectors and HDFS connector for Hadoop.

    1. To begin with, install Confluent’s public key by running the command: wget -qO -http://packages.confluent.io/deb/2.0/archive.key | sudo apt-key add -
    2. Now add the repository to your sources.list by running the following command: sudo add-apt-repository "deb http://packages.confluent.io/deb/2.0 stable main"
    3. Update your package lists and then install the Confluent platform by running the following commands: sudo apt-get updatesudo apt-get install confluent-platform-2.11.7
  2. Install DataDirect PostgreSQL JDBC driver
    1. Download DataDirect PostgreSQL JDBC driver by visiting here.
    2. Install the PostgreSQL JDBC driver by running the following command: java -jar PROGRESS_DATADIRECT_JDBC_POSTGRESQL_ALL.jar
    3. Follow the instructions on the screen to install the driver successfully (you can install the driver in evaluation mode where you can try it for 15 days, or in license mode, if you have bought the driver)
  3. Configuring data sources for Kafka Connect
    1. Create a new file called postgres.properties, paste the following configuration and save the file. To learn more about the modes that are being used in the below configuration file, visit this pagename=test-postgres-jdbcconnector.class=io.confluent.connect.jdbc.JdbcSourceConnectortasks.max=1connection.url=jdbc:datadirect:postgresql://<;server>:<port>;User=<user>;Password=<password>;Database=<dbname>mode=timestamp+incrementingincrementing.column.name=<id>timestamp.column.name=<modifiedtimestamp>topic.prefix=test_jdbc_table.whitelist=actor
    2. Create another file called hdfs.properties, paste the following configuration and save the file. To learn more about HDFS connector and configuration options used, visit this pagename=hdfs-sinkconnector.class=io.confluent.connect.hdfs.HdfsSinkConnectortasks.max=1topics=test_jdbc_actorhdfs.url=hdfs://<;server>:<port>flush.size=2hive.metastore.uris=thrift://<;server>:<port>hive.integration=trueschema.compatibility=BACKWARD
    3. Note that postgres.properties and hdfs.properties have basically the connection configuration details and behavior of the JDBC and HDFS connectors.
    4. Create a symbolic link for DataDirect Postgres JDBC driver in Hive lib folder by using the following command: ln -s /path/to/datadirect/lib/postgresql.jar /path/to/hive/lib/postgresql.jar
    5. Also make the DataDirect Postgres JDBC driver available on Kafka Connect process’s CLASSPATH by running the following command: export CLASSPATH=/path/to/datadirect/lib/postgresql.jar
    6. Start the Hadoop cluster by running following commands: cd /path/to/hadoop/sbin./start-dfs.sh./start-yarn.sh
  4. Configuring and running Kafka Services
  5. Download the configuration files for Kafkazookeeper and schema-registry services
  6. Start the Zookeeper service by providing the zookeeper.properties file path as a parameter by using the command: zookeeper-server-start /path/to/zookeeper.properties
  7. Start the Kafka service by providing the server.properties file path as a parameter by using the command: kafka-server-start /path/to/server.properties
  8. Start the Schema registry service by providing the schema-registry.properties file path as a parameter by using the command:               schema-registry-start /path/to/ schema-registry.properties

Ingesting Data Into HDFS using Kafka Connect

To start ingesting data from PostgreSQL, the final thing that you have to do is start Kafka Connect. You can start Kafka Connect by running the following command:

connect-standalone /path/to/connect-avro-standalone.properties \ /path/to/postgres.properties /path/to/hdfs.properties

This will import the data from PostgreSQL to Kafka using DataDirect PostgreSQL JDBC drivers and create a topic with name test_jdbc_actor. Then the data is exported from Kafka to HDFS by reading the topic test_jdbc_actor through the HDFS connector. The data stays in Kafka, so you can reuse it to export to any other data sources.

Next Steps

We hope this tutorial helped you understand on how you can build a simple ETL pipeline using Kafka Connect leveraging DataDirect PostgreSQL JDBC drivers. This tutorial is not limited to PostgreSQL. In fact, you can create ETL pipelines leveraging any of our DataDirect JDBC drivers that we offer for Relational databases like OracleDB2 and SQL Server, Cloud sources likeSalesforce and Eloqua or BigData sources like CDH HiveSpark SQL and Cassandra by following similar steps. Also, subscribe to our blog via email or RSS feed  for more awesome tutorials.

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate andwhy your current integration solution is not enough, brought to you in partnership with Liaison.

时间: 2024-08-23 10:18:50

Build an ETL Pipeline With Kafka Connect via JDBC Connectors的相关文章

Kafka Connect Architecture

Kafka Connect's goal of copying data between systems has been tackled by a variety of frameworks, many of them still actively developed and maintained. This section explains the motivation behind Kafka Connect, where it fits in the design space, and

Apache Kafka系列(五) Kafka Connect及FileConnector示例

Apache Kafka系列(一) 起步 Apache Kafka系列(二) 命令行工具(CLI) Apache Kafka系列(三) Java API使用 Apache Kafka系列(四) 多线程Consumer方案 Apache Kafka系列(五) Kafka Connect及FileConnector示例 一. Kafka Connect简介 Kafka是一个使用越来越广的消息系统,尤其是在大数据开发中(实时数据处理和分析).为何集成其他系统和解耦应用,经常使用Producer来发送消

Kafka Connect Details 详解

目录 1. Kafka Connect Details 详解 1.1. 概览 1.2. 启动和配置 1.2.1. Standalone 单机模式 1.2.2. Distribute 分布式模式 1.2.3. Connector的配置 1.3. Transformations 转换器 1.4. REST API 1.5. Kafka Connect 开发详解 1.6. Kafka Connect VS Producer Consumer 1.6.1. Kafka Connect的优点 1.7. 第

Kafka: Connect

Kafka Connect 简介 Kafka Connect 是一个可以在Kafka与其他系统之间提供可靠的.易于扩展的数据流处理工具.使用它能够使得数据进出Kafka变得很简单.Kafka Connect有如下特性: ·是一个通用的构造kafka connector的框架 ·有单机.分布式两种模式.开发时建议使用单机模式,生产环境下使用分布式模式. ·提供restful的管理connector的API. ·自动化的offset管理.Kafka Connect自动的管理offset提交. ·分布

Kafka connect in practice(2): distributed mode mysql binlog -&gt;kafka-&gt;hive

In the previous post Kafka connect in practice(1): standalone, I have introduced about the basics of kafka connect  configuration and demonstrate a local standalone demo. In this post we will show the knowledge about distributed data pull an sink. To

Kafka Connect REST Interface

Since Kafka Connect is intended to be run as a service, it also supports a REST API for managing connectors. By default this service runs on port 8083. When executed in distributed mode, the REST API will be the primary interface to the cluster. You

Kafka Connect HDFS

概述 Kafka 的数据如何传输到HDFS?如果仔细思考,会发现这个问题并不简单. 不妨先想一下这两个问题? 1)为什么要将Kafka的数据传输到HDFS上? 2)为什么不直接写HDFS而要通过Kafka? HDFS一直以来是为离线数据的存储和计算设计的,因此对实时事件数据的写入并不友好,而Kafka生来就是为实时数据设计的,但是数据在Kafka上无法使用离线计算框架来作批量离线分析. 那么,Kafka为什么就不能支持批量离线分析呢?想象我们将Kafka的数据按天拆分topic,并建足够多的分区

[Ramada] Build a Functional Pipeline with Ramda.js

We'll learn how to take advantage of Ramda's automatic function currying and data-last argument order to combine a series of pure functions into a left-to-right composition, or pipeline, with Ramda's pipe function. A simple example will take 'teams'

Spark Streaming、Kafka结合Spark JDBC External DataSouces处理案例

场景:使用Spark Streaming接收Kafka发送过来的数据与关系型数据库中的表进行相关的查询操作: Kafka发送过来的数据格式为:id.name.cityId,分隔符为tab 1 zhangsan 1 2 lisi 1 3 wangwu 2 4 zhaoliu 3 MySQL的表city结构为:id int, name varchar 1 bj 2 sz 3 sh 本案例的结果为:select s.id, s.name, s.cityId, c.name from student s