My journey introducing the data build tool (dbt) in project’s analytical stacks

转自:https://www.lantrns.co/my-journey-introducing-the-data-build-tool-dbt-in-projects-analytical-stacks/

Not sure I remember how, but I had the good luck a few weeks ago to stumble upon posts from Tristan Handy where he mentioned a tool his team built, simply called the Data Build Tool (dbt). I immediately took interest, as I was struggling with ETL (Extract, Transform, Load) at that moment, and had the extreme good fortune of having 2 clients who were willing to let me experiment with it, and even eventually use it for their own data warehouse.

Starting out with such a tool can sometimes be a bit confusing, as you’re not sure how and where to start. You can envision how dbt will fit in your overall analytical stack, but you might need a little bit of guidance to get things started. Although the documentation is really good, I feel like it’s always a good thing to read about someone else’s experimenting before starting my own.

So I decided to put together this little article that just goes over my own experience working with dbt, what I’ve picked up along the way, and how it all fits in our overall data engineering journey. I’m no power user of dbt and this may seem superficial to more experienced users, but the goal here is to help new users by sharing my own journey so far.

Getting started

As I mentioned, I have the good fortune to be working with clients who are really keen on adopting the technologies that will make their analytical stack more robust and scalable. And that was actually one problem I was facing with one project. What had been deployed so far worked perfectly for the needs that were specified, but as new requirements came in, it was obvious that our solution wasn’t scalable. We were using python scripts to do the ETL stuff, and that was neither agile, nor DRY (Don’t Repeat Yourself), nor fun. So we wanted to improve our ETL approach, and dbt just fell on our lap right there and then.

Another project is with a startup that just started investing into their analytical infrastructure. This is pretty rare, but here I was with the opportunity to propose and build an analytical stack from scratch. Of course, I wanted dbt to be part of it.

So this is how I started looking more seriously at what dbt had to offer, how it could be used in a production environment, how we could test it and how it could handle future needs.

What’s interesting with dbt is how instead of doing ETL, you’re really taking advantage of the power of cloud-based data warehouses to do ETL instead.

What we are doing is that instead of adopting a traditional ETL approach that requires the maintenance of scripts that does all 3 steps (extract, transform and load data towards the data warehouse), we are moving to a ELT approach where data is first moved (extract + load) to a data warehouse and transformation is done on the data warehouse directly.

That has the benefit of reducing the complexity of scripts (we are now only dealing with pure SQL scripts), maintenance (all models are simple select statements that are ran through a tool called dbt – data build tool) and integrity (once business logic is added within a model, all further models down the chain will always reuse that same business logic).

What is dbt and how does it fit

I should first mention that dbt is the work of the good folks at Fishtown Analytics. I believe that dbt was built while the co-founders were working for another company and that they kinda kept dbt with them as they built their new company (I might be completely wrong on this one). But the point is that they developed that tool, made it publicly available and are building a great community around it. My hat’s off to them – we should all aspire to bring such excellent tools to our data engineering community.

So, what is dbt?

I like to think of it as SQL on steroids.

Essentially, what we do is that we are building SQL models on top of source tables (coming from your production database, cloud services, website events, etc.) which are then referred to by further models, acting as building blocks towards a final model that’s part of a mart within your data warehouse.

For example, let’s say I want to build a fact table for transactions. But transactions may be coming from different sources. I would build models that refer to one another until you get to that final table. Here’s an example taken from a project I’m working on…

dbt – an exemple of data modeling

That graphic above might seem like a nightmare, but there is method to that madness (I think, lol). Essentially, you’re looking at models on the left that refer to tables in staging area, a bunch of transformations in the middle, and the final entity at the far right. We’ll talk more about this below, as well as how to generate that graph.

×

HAVE YOU SEEN OUR PRODUCT ANALYTICS NEWSLETTER?

OUR MISSION IS TO SHARE DATA ANALYTICS’ BEST PRACTICES AND NEW IDEAS WITH PRODUCT OWNERS, SO THEY CAN IMPROVE PRODUCT-MARKET FIT AND ACCELERATE THEIR DEVELOPMENT CYCLE.

HAVE A LOOK HERE.

Community

Now that we have a high-level understanding of what dbt is and how it fits within your analytical stack, I think we should talk a little bit about its community, because it does make working with dbt such a positive experience.

We should mention first that dbt is totally open source and you can go play inside the beast if that’s your thing.

Once you’ve satisfied your curiosity, you should definitely consider joining the dbt community on Slack. This is a pretty active community composed of members who are really helpful to one another. And the creators of dbt are also always present and ready to help whenever you stumble upon a problem. But what’s pretty cool also is that the discussions will definitely make you a better data engineer, as some of the questions/answers and the resources shared do open up your horizons to many other facets on how to build a modern analytical stack.

One other resource that’s starting to pick up steam is dbt’s Discourse environment. This is where you can find some pretty thorough answers to commonly asked questions. Or at least that’s what the idea of it is. This is still pretty new, but I think there’s a lot of potential with this and there are already some gems that are living there.

And last but not least, dbt’s documentation. This of course will be a resource you refer to often as you start working with dbt, but also as you get more sophisticated with the use of the tool. There are solid documentations here, which pretty much answers all the questions you might have in regards to using the different features of this tool.

So, all in all, I think that this community is as important to the success of dbt as the tool itself. You can always count on the documentation, the already answered questions in Discourse, or the availability of other community members on Slack, to point you in the right direction.

First Steps

Alright, now that we completed that grand tour of dbt, it’s about time we get our hands dirty.

To get things started, I would definitely recommend going through that dbt tutorial provided in the documentation. You’ll learn how to install dbt and create a project.

There are also useful guides, which are to be consulted as you discover some of those aspects of dbt. But I would definitely recommend going through the “Configuring models” guide, as this is essential skill to work with dbt.

I would also recommend that you take 20 minutes of your time and listen to the “Project Walkthrough Video”, which actually confirmed to me that dbt was actually as easy as I suspected it to be. Up to that video, I was convinced that I was missing something that would make dbt complex to use, but nope. If you know SQL, you’ll be confortable with dbt right from the start.

Hopefully, by this point you have a good understanding of dbt, how it fits within your architecture, how to set it up and have created your first project. Now what?

Well, you would need some data of course. Without going over the details, because this is out of scope for this article, let’s assume you have a few source tables in a Redshift/PostgreSQL database.

An example for me was to transform an export of Eventbrite registrations into a clean DW structure to be used in Tableau. My staging model would look something like this…


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19


SELECT

CAST(‘eventbrite‘ AS VARCHARAS source,

CAST("Order #"AS integerAS registration_natural_key,

CAST("Event ID"AS integerAS event_natural_key,

CAST("Event Name"AS varchar(64)) AS event_name,

CAST("Attendee #"AS integerAS attendee_natural_key,

CAST("Email"AS varchar(128)) AS attendee_email,

CAST("First Name"AS varchar(32)) AS attendee_first_name,

CAST("Last Name"AS varchar(32)) AS attendee_last_name,

CAST("Job Title"AS varchar(128)) AS attendee_job_title,

CAST("Company"AS varchar(32)) AS company_name,

CAST("Ticket Type"AS varchar(32)) AS tickets_type,

CAST("Quantity"AS smallintAS tickets_quantity,

CAST("Total Paid"AS decimal(6, 2)) AS transaction_amount,

CAST("Currency"AS char(3)) AS currency,

CAST("Order Date"AS timestampAS registration_datetime</pre>

FROM

eventbrite.registrations

I could then derive entities such as dim_event…


1

2

3

4

5

6

7

8

9

10

11

12


WITH registration AS (

SELECT FROM {{ ref(‘stg_eventbrite_registrations‘) }}

)

SELECT DISTINCT

{{ dbt_utils.surrogate_key(‘registration.source‘‘registration.event_natural_key‘‘registration.event_name‘) }} AS surrogate_key,

registration.source,

registration.event_natural_key AS natural_key,

registration.event_name AS name

FROM registration

dim_attendee…


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17


WITH registration AS (

SELECT FROM {{ ref(‘stg_eventbrite_registrations‘) }}

)

SELECT DISTINCT

{{ dbt_utils.surrogate_key(‘registration.source‘‘registration.attendee_natural_key‘‘registration.attendee_email‘) }} AS surrogate_key,

registration.source,

registration.attendee_natural_key AS natural_key,

registration.attendee_email AS email,

registration.attendee_first_name AS first_name,

registration.attendee_last_name AS last_name,

registration.attendee_job_title AS job_title,

registration.company_name

FROM registration

and fct_registration…


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19


WITH registration AS (

SELECT FROM {{ ref(‘stg_eventbrite_registrations‘) }}

)

SELECT

{{ dbt_utils.surrogate_key(‘registration.source‘‘registration.registration_natural_key‘‘registration.attendee_natural_key‘‘registration.event_natural_key‘) }} AS surrogate_key,

registration.source,

registration.registration_natural_key AS natural_key,

registration.attendee_natural_key,

registration.event_natural_key,

registration.tickets_quantity,

registration.tickets_type,

registration.transaction_amount,

registration.currency,

registration.registration_datetime

FROM registration

SinterData provides a tool to view the relationships between our models, which when our data warehouse becomes complex, is a great tool to have at our disposal. For our current example, we end up with the following DAG (directed acyclic graph)

dbt – another exemple of data modeling

And that’s it, we just created a first version of our data warehouse.

Good practices

The example above is pretty basic, but as things become more complex, you’ll need to start picking up some best practices. Some of them, you can introduce them right away, some of them you’ll learn of as you go, and some of them will impose themselves upon you.

Here’s a non-exhaustive list of “good practices” I picked up during my first few projects.

Folder Structure

How you structure your data warehouse and how you structure your dbt/models folder are 2 entirely different things. There was some good discussions around folder organization in dbt, but I think in the end it all depends on how you would like to work.

Here’s what works for me. I have 4 main folders:

  • Staging – This is where I create my foundational models on top of the source tables that were exported to the data warehouse. I do a selection of fields, typecast, filter and rename the fields if necessary. You know, basic maintenance.
  • Transform – This is where the bulk of the work happens. It’s where staging models are being used and transformed to become the final entities. You would join sources, add aggregates, filter, etc.
  • Entities – This is where I have my final fact and dimension models.
  • FinalModels (I usually have a more business-focused name, such as ‘Marketplace’) – This is where I create the main queries that are being used by BI tools. There are discussions in regards to building your final model within dbt or directly in your BI tool, but I personally like to have the “big ones” here in dbt with all other queries.

In regards to the Transform table, even here I have a specific folder structure underneath. Here’s how it looks:

  • EntityName

    • Source1

      • EntityName_Source1_1_FirstTransformation.sql
      • EntityName_Source1_2_SecondTransformation.sql
    • Source2
    • EntityName_1_MergeSources.sql

As you can see, I like to keep it structured by source and sequence. This really helps whenever I have to introduce new features or when debugging.

Testing Suite

Here’s something that I didn’t use first, but that was always in the back of my mind.

I actually was looking for a solution outside of dbt, because I had a specific test scenario that I wanted to do and didn’t know much about the testing framework in dbt to get it working with dbt itself. So I had asked my question on Slack directly and Tristan pointed me out to custom schema tests.

And that kinda propelled me in the world of testing within the confines of dbt. The Data Build Tool already has a few tests that can be used right off the box, but there’s also the dbt-utils package which provides other tests that can be used in your own suite of testing.

Right now, for one project, I have over 60 tests that I run in my test environment after doing changes and that I run automatically every hour in the prod environment. Whenever a test fails in prod, an alert is being sent to our Slack channel.

Amazing!

Data Quality Exploration

Once I’ve made changes in my dev environment, built my models and tested them, I like to use Tableau to actually visualize my Data Warehouse before it gets built in prod. Some sort of Quality Assessment.

I’m not doing anything fancy at this point, but I do like to plot my main entities’ data sets to make sure that no weird anomalies were introduced during development.

The testing suite that was discussed previously safeguards my data at the atomic level, but my Tableau QA workbook safeguards data quality at the aggregated level.

Work Flow

That brings me to my overall work flow. Here are the steps I take when introducing a new feature within a data warehouse:

  1. Research/Exploration into required fields and their sources
  2. Write up tests as acceptance criterias
  3. Build data models in dbt
  4. Testing in development environment
  5. Quality assessment with Tableau
  6. Commit code to git repository
  7. Pull new code in prod
  8. Build new ETL in prod
  9. Testing in prod environment
  10. Integrate new feature in final BI report
  11. Push to Sandbox for client approval

Extras

I did briefly touch on those previously, but there are a few extras worth knowing about:

  • dbt-utils – This package is provided by the creators of dbt and packs in a few extras that are worth looking at. I personally use only a few of what’s provided (such as the ‘surrogate_key’ SQL helper, as seen in the models above), but there is a lot to explore here and that might also be useful.
  • Sinter – Not sure how that service is associated to dbt and Fishtown Analytics, but their service is all about dbt. Essentially, they provide the means to run, schedule and manage your dbt jobs. Simple, yet really well made.
  • dbt Graph Viewer – Actually provided by the folks at Sinter, this little tool is something I use frequently. Not only can you see the whole DAG of your project, but you can also focus on only one model and look at their parent and/or children models.

I should also quickly talk about macros, which are potentially a big thing, but that I haven’t gotten around to. Although I did just read that article by Claire Carole on design patterns which talks about maintaining your code DRY, using Jinja’s templating language and the introduction of macros. Anyways, still a whole lot of stuff to discover

The Big Picture

As you can see, dbt is more than just a simple tool. I like to think of it as a core tool within your data warehouse building arsenal.

You can think of it as following the principles of the Unix Tools Philosophy, where the goal is to have an ensemble of tools that do one job only, but do it really well.

But something else that is important in that philosophy. It is for those tools to easily interact together (by the use of the pipe “|” symbol which pipes output from one tool as input to another tool). That dbt keeps really close to SQL is what makes it so easy to adopt, but also so easy to interact with many other tools. It’s a common language that allows for outputs to become inputs for another tool.

But also, the choice of SQL is one that assures interoperability within a modular BI architecture. As is argued by Ajay Kulkarni, co-founder of TimeScaleDB:

“We believe that SQL has become the universal interface for data analysis.”
(Source: Why SQL is beating NoSQL, and what this means for the future of data)

There is a lot that’s been said and written about the modern BI architecture, with cloud-based services as their underlying infrastructure and modularity as a core principle. Having tools such as dbt adds to that philosophy by being itself modular (and replacable), but to actually be very flexible so that it can work with many different inputs (data sources) and outputs (data warehouses).

Here’s to your dbt journey, to the future development of dbt and the community that is being fostered around that project!

原文地址:https://www.cnblogs.com/rongfengliang/p/10981250.html

时间: 2024-08-09 13:51:01

My journey introducing the data build tool (dbt) in project’s analytical stacks的相关文章

grunt 构建工具(build tool)初体验

操作环境:win8 系统,建议使用 git bash (window下的命令行工具) 1,安装node.js 官网下载:https://nodejs.org/  直接点击install ,会根据你的操作系统下载对应的 版本 检测是否安装 node -v 现在我们来运行一个简单的node程序,创建hello.js 文件,复制下面的代码: var http = require("http"); http.createServer(function(request, response) {

声明下Percona Data Recovery Tool for InnoDB工具的问题!(带遇到问题的说明)

首先呢,请各位注意此工具的适用范围: 1)本次应用的恢复工具仅适用与innodb存储引擎,Myisam不支持 2)Truncate tabe 不能恢复 3)Drop table 想也别想了 恢复原理: 对于INNODB存储引擎而言,DELETE操作,不是真正删除物理文件上的行,而是给删除的行添加了一个删除的标记,我们利用此工具找到那些标注了删除标记的行,然后将其存放到一个文本中去,最后通过load data恢复数据:而truncate操作,是直接将数据行清空,并非添加删除标记(查看物理文件,执行

build tool(构建工具)

什么是build tool? 构建工具是从源代码自动创建可执行应用程序的程序.构建包括将代码编译,链接和打包成可用或可执行的形式.在小项目中,开发人员通常会手动调用构建过程.这对于较大的项目来说是不实际的,在这些项目中,很难跟踪需要构建的内容,构建过程中的顺序和依赖关系.使用自动化工具可以使构建过程更加一致.基本上构建的自动化是编写或使一大部分任务自动执行的一个动作,而这些任务则是软件开发者的日常,像是: 1.下载依赖 2.将源代码编译成二进制代码 3.打包生成的二进制代码 4.进行单元测试 5

build tool

build tool(构建工具): 一.什么是构建工具? 构建工具是一个把源代码生成可执行应用程序的过程自动化的程序(例如 Android App生成 apk).构建包括编译.链接以及把代码打包成可用的或可执行的形式. 二.为什么要用构建工具? 一句话:自动化.对于需要反复重复的任务,例如压缩(minification).编译.单元测试.linting等,自动化工具可以减轻你的劳动,简化你的工作.当你正确配置好了任务,任务运行器就会自动帮你或你的小组完成大部分无聊的工作. 三.Java世界中主要

Build tool简介

Build tool Build tool中文构建工具.构建工具能够帮助你创建一个可重复的.可靠的.携带的且不需要手动干预的构建.构建工具是一个可编程的工具,它能够让你以可执行和有序的任务来表达自动化需求.假设你想要编译源代码,将生成的class文件拷贝到某个目录,然后将该目录组装成可交付的软件. Maven Maven是一个项目管理工具,它包含了一个项目对象模型 (Project Object Model),一组标准集合,一个项目生命周期(Project Lifecycle),一个依赖管理系统

buils tool是什么?java主流的build tool

定义: build tool是可以自动由源代码创建可执行的应用程序的程序. Building 包括编译.链接和打包代码成一个可用的或可执行形式. 在小型项目,开发人员常常会手动调用构建过程.在更大的项目中这是不实用的,那样会很难跟踪需要构建什么,在什么顺序和依赖关系构建的过程. 使用自动化工具允许构建过程更一致. 为什么使用build tool: 日常开发中我们当然不会每次都在终端中使用命令一个个编译执行文件,我们只需要在IDE中点击运行按钮,IDE就会帮助我们执行构建项目的全过程,但是使用ID

Introduction of Build Tool/Maven, Gradle

---恢复内容开始--- 什么是build tool: build tool是可以自动由源代码创建可执行的应用程序的程序. Building 包括编译.链接和打包代码成一个可用的或可执行形式. 在小型项目,开发人员常常会手动调用构建过程.在更大的项目中这是不实用的,那样会很难跟踪需要构建什么,在什么顺序和依赖关系构建的过程. 使用自动化工具允许构建过程更一致. 为什么使用build tool: 日常开发中我们当然不会每次都在终端中使用命令一个个编译执行文件,我们只需要在IDE中点击运行按钮,ID

This compilation unit is not on the build path of java project 错误

今天从SVN上面检出了一个项目,在写代码的时候在输入  . 后就报 This compilation unit is not on the build path of java project 错误. 平常的项目 在输入  . 之后都会有相应的代码提示唯独这个就没有..感觉很是纠结,于是在网上找了很多资料.总算解决了. 首先打开自己项目的  .project 文件  ,如果在myeclipse里面找不到这个文件可以 ,ctrl+shift+R一起按,在弹出的框里面直接输入这个文件名字... 打开

Percona Data Recovery Tool for InnoDB--mysql innodb引擎表非常规修复工具

如果线上的MySQL生产数据库的数据被误删除,然后DBA去会恢复数据的时候,发现该数据库没有做备份.binlog也没有开启的话.还有其他手段去尽力去恢复数据吗? percona公司提供了一个非常规的修复工具,可以去修复表数据.当然这个工具是有限制的: 1.仅针对innodb引擎的表 2.表的row_format必须是REDUNDANT或者COMPACT,一般建议为COMPACT.而mysql5.7.8以上默认为Dynamic,这个要特别注意. 3.一旦发生误操作,需要尽快停止对事故表的写入,将i