In this tutorial, we'll walk through setting up a minimal project using DBT (Data Build Tool) for data transformation. We'll cover setting up the environment, creating queries, running transformations, and deploying to production. I set up this project because I am trying to figure out how to spin up my own DBT project on my own data set. Existing tutorials use jaffle_shop data and well-established dbt projects. I just wanted to figure out the minimum needed to run re-occurring data transforms using DBT.
This example uses BQ, but I am going to try to keep the tutorial as agnostic as possible. The code for the project is found in this GitHub repo.
Project Goal
The goal of this project is to take data from a public dataset, perform transformations using DBT, and schedule these transformations to run on a regular basis. The transformed data will be added to one or more target tables for analysis. We will divide the steps into two parts:
Part 1: DBT setup
Part 2: GCP deploy
Part 1: setup DBT
Prerequisites
Basic knowledge of SQL
Access to a cloud provider (e.g., Google Cloud Platform for BigQuery)
Working terminal environment. I am running on a Mac, so my instructions will be for Mac.
Steps
Install DBT CLI: DBT has a flavor for every datastore you use, in addition to the flavor of their cloud-hosted project. I am using the open-source version for the BQ model. To figure out which version of the flavor of DBT you need to check out the install instructions.
Here is the code used to install the DBT CLI by tapping the DBT repository:
brew untap dbt-labs/dbt
brew tap dbt-labs/dbt-bigquery
Initialize DBT Project: So the setup for a DBT cloud project is different than a non-cloud project. If you use a cloud project, the web UI will create a project for you, but if you are like me and want to use an open-source option you will need to create your own project in a repo. The dbt-cli has a command for this. I chose to name the new DBT project
cust-analytics
:
dbt init cust-analytics
and then inside the cust-analytics
repo I was configured git
cd cust-analytics
git init
I would also recommend adding the following to .gitignore
profiles.yml
.user.yml
logs/
The dbt init
the command created a lot of extra empty repos. So if you want to remove the boilerplate you will end up with a slimmed-down directory like this.
├── README.md
├── dbt_project.yml
├── models
│ └── example
│ ├── bank_and_population_data.sql
│ └── bank_data.sql
└── profiles.yml
In the dbt_project.yml
, we need to complete the last config step by pointing to the project and models.
name: 'cust_analytics'
version: '1.0.0'
config-version: 2
profile: 'cust_analytics'
model-paths: ["models"]
clean-targets:
- "target"
- "dbt_packages"
models:
cust_analytics:
# Config indicated by + and applies to all files under models/example/
examples:
+schema: test_public_data # in BQ dataset = dbt schema and project = dbt database
This says what to build, where to put the build auxiliary files, and where my transformations are located. I have assigned the BQ dataset to be called test_public_data
.
Create Queries: With the completely configured repo, we can now add SQL queries that will build and populate our tables. Add SQL queries to the
.sql
files in your project directory. If you already have tables in your dataset, you will need to follow your schema's protocol, but if like me, you have to tables in the dataset, then DBT will create the tables for you.
My two queries are on data available in the BigQuery public data project. These are not actually queried with an analytics use case, but I wanted to test the concept of combining resources from two different datasets and creating a new view. So, TLDR, this in an engineer testing a feature set not an analytics gathering insights.
The format of queries is pretty straightforward. I created a CTE and then used that CTE to create a grouping of useful data. DBT formats everything like a SELECT
statement. So instead of Inserting
data into a DB, you can think of it as just querying the data you want and DBT handles the insert work for you. The last line of the file was the most interesting to me.
select * from bank_assets;
This says that I want to use all the data in my bank_assets
CTE in the new table the DBT is creating. The table’s name is the name of the time BTW.
Configure Profiles: The second to last step is to connect the local instance of DBT to our GCP project. Edit the
profiles.yaml
file with the necessary connection information. Locally I am using the json keys method of connecting to DB, but in part 2 we will switch to Oauth via cloud service accounts.
One thing that I particularly like to do is store my profile.yaml
in a specific location based on convenience or need. When you run dbt init
it creates a profiles.yml
in-the ~/.dbt/profiles.yaml
. I prefer to keep it close to the repo and not combine it the across DBT project.
Run DBT: Execute the DBT transformations locally to ensure they run successfully. This is the last step in the tutorial but is also a valid way to run DBT jobs. If you are running DBT on-prem, or connecting to a different cloud for your data warehouse, running through a cron job and the
dbt run
command is a valid option. Since I move myprofiles.yml
from the default location I need to specify where it should be run from.
dbt run --profiles-dir .
By following this tutorial, you should have a basic understanding of how to set up a project using DBT for data transformation.