#
Course Introduction
OBJECTIVES
● Data-maturity model
● dbt and data architectures
● Data warehouses, data lakes, and lakehouses
● ETL and ELT procedures
● dbt fundamentals
● Analytics Engineering
Project
3
Practice 2
Theory
1
TOP-DOWN
Practice
1
Theory
3
BOTTOM-UP
Project 2
#
Data Maturity Model
Maslow's Hierarchy of Needs
5 Self-actualization
3 Belongingness & Love Needs
1 Psychological Needs
4
Esteem Needs
2
Safety Needs
Data-Maturity Model
5 Artificial Intelligence
3 Data Integration
1 Data Collection
4
BI and Analytics
2
Data Wrangling
Typical Data Architecture
3 Data Integration
1 Data Collection
2
Data Wrangling
ML/AI/BI
4
Data Collection
1 Data Collection
Data Wrangling
1 Data Collection
2
Data Wrangling
Data Integration
3 Data Integration
1 Data Collection
2
Data Wrangling
#
ETL - ELT
ETL
ELT
#
Normalization
#
1NF
1NF
1
1NF
2
1
1NF
3
2
1
1NF
3
4
#
2NF
2NF
1
2NF
1
2
2NF
2
3
#
3NF
3NF
1
3NF
1
2
#
Data Warehouse
Data Warehouse
On-Premise Cloud
PNG Credits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png
#
External Tables
External Tables
External Tables
Data Lake
#
Data Lake
Data Lake
Unstructured / Structured / Semi-Structured Data
PNG Credits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png
#
Data Lakehouse
Data Lakehouse
PNG Credits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png
#
Slowly Changing Dimensions
SCD Type 0
Not updating the DWH table when a Dimension changes
Source DWH
SCD Type 1
Updating the DWH table when a Dimension changes, overwriting the original data
Source DWH
No Air-conditioning
Installed Air-conditioning
DWH updated
SCD Type 2
Keeping full history - Adding additional (historic data) rows for each dimension change
Source DWH
Current rental price ($300)
Change in the rental price ($450)
DWH updated
SCD Type 2
Keeping full history - Adding additional (historic data) rows for each dimension change
Source DWH
Current rental price ($300)
Change in the rental price ($450)
DWH updated
SCD Type 3
Keeping limited data history - adding separate columns for original and current value
Source DWH
Listed as Private
Host changed Private to Entire
Host changed Entire to Shared
#
dbt Overview
dbt Overview
PNG Credits: https://data.solita.fi/wp-content/uploads/2021/06/dbt-transform.png
#
Analytics Engineering
#
Common Table Expression (CTE)
CTE
Syntax
CTE
Example
PROJECT OVERVIEW
Analytics Engineering with Airbnb
ANALYTICS ENGINEERING WITH AIRBNB
● Simulating the life of an Analytics Engineer in Airbnb
● Loading, Cleansing, Exposing data
● Writing test, automations and documentation
● Data source: Inside Airbnb: Berlin
TECH STACK
REQUIREMENTS
● Modeling changes are easy to follow and revert
● Explicit dependencies between models
● Explore dependencies between models
● Data quality tests
● Error reporting
● Incremental load of fact tables
● Track history of dimension tables
● Easy-to-access documentation
NEXT STEPS - SETUP
● Snowflake registration
● Dataset import
● dbt installation
● dbt setup, snowflake connection
That’s it :)
SNOWFLAKE
Registering a Trial account
DATA FLOW
Overview
INPUT DATA MODEL
DATA FLOW OVERVIEW
Presentation
Layer
Core Layer
Staging Layer
Raw layer
raw_listings
raw_hosts
raw_reviews
src_listings
basic checks
src_hosts
basic checks
src_reviews:
basic checks
Hosts:
cleansed
Listings:
cleansed
Reviews:
cleansed
Listings mart
table
Reviews mart
table
DATA FLOW OVERVIEW
DBT SETUP
Mac
INSTALLING DBT
● Install Python3 virtualenv
● Create a virtualenv
● Activate virtualenv
● Install dbt-snowflake
DBT SETUP
Windows
INSTALLING DBT
● Install Python3
● Create a virtualenv
● Activate virtualenv
● Install dbt and the dbt-snowflake connector
VSCODE SETUP
Installing the dbt power user extension
DBT SETUP
dbt init and connecting to Snowflake
MODELS
LEARNING OBJECTIVES
● Understand the data flow of our project
● Understand the concept of Models in dbt
● Create three basic models:
○ src_listings
○ src_reviews: guided exercises
○ src_hosts: individual lab
MODELS OVERVIEW
● Models are the basic building block of your business logic
● Materialized as tables, views, etc…
● They live in SQL files in the `models` folder
● Models can reference each other and use templates and macros
Staging Layer
Raw layer
raw_listings
raw_hosts
raw_reviews
src_listings
basic checks
src_hosts
basic checks
src_reviews
basic checks
DATA FLOW PROGRESS
GUIDED EXERCISE
src_reviews.sql
Create a new model in the `models/src/` folder called
`src_reviews.sql`.
● Use a CTE to reference the AIRBNB.RAW.RAW_REVIEWS table
● SELECT every column and every record, and rename the following columns:
○ date to review_date
○ comments to review_text
○ sentiment to review_sentiment
● Execute `dbt run` and verify that your model has been created
(You can find the solution among the resources)
MATERIALIZATIONS
LEARNING OBJECTIVES
● Understand how models can be connected
● Understand the four built-in materializations
● Understand how materializations can be configured on the file and
project level
● Use dbt run with extra parameters
MATERIALIZATIONS
MATERIALISATIONS OVERVIEW
View
Use it
- You want a
lightweight
representation
- You don’t reuse data
too often
Don’t use it
- You read from the
same model several
times
Table
Use it
- You read from this
model repeatedly
Don’t use it
- Building single-use
models
- Your model is
populated
incrementally
Incremental
(table appends)
Use it
- Fact tables
- Appends to tables
Don’t use it
- You want to update
historical records
Ephemeral
(CTEs)
Use it
- You merely want an
alias to your date
Don’t use it
- You read from the
same model several
times
Staging Layer
Raw layer
raw_listings
raw_hosts
raw_reviews
src_listings
basic checks
src_hosts
basic checks
src_reviews
basic checks
DATA FLOW PROGRESS
Core Layer
dim_listings_cleansed
cleansing
dim_hosts_cleansed
cleansing
fct_reviews
incremental
dim_listings_with_hosts
final dimension table
GUIDED EXERCISE
dim_hosts_cleansed.sql
Create a new model in the `models/dim/` folder called
`dim_hosts_cleansed.sql`.
● Use a CTE to reference the `src_hosts` model
● SELECT every column and every record, and add a cleansing step to
host_name:
○ If host_name is not null, keep the original value
○ If host_name is null, replace it with the value ‘Anonymous’
○ Use the NVL(column_name, default_null_value) function
● Execute `dbt run` and verify that your model has been created
(You can find the solution among the resources)
SOURCES & SEEDS
LEARNING OBJECTIVES
● Understand the difference between seeds and sources
● Understand source-freshness
● Integrate sources into our project
SOURCES AND SEEDS OVERVIEW
● Seeds are local files that you upload to the data warehouse from dbt
● Sources is an abstraction layer on the top of your input tables
● Source freshness can be checked automatically
SNAPSHOTS
LEARNING OBJECTIVES
● Understand how dbt handles type-2 slowly changing dimensions
● Understand snapshot strategies
● Learn how to create snapshots on top of our listings and hosts models
SNAPSHOTS
Overview
TYPE-2 SLOWLY CHANGING DIMENSIONS
host_id host_name email
1 Alice alice.airbnb@gmail.com
2 Bob bob.airbnb@gmail.com
TYPE-2 SLOWLY CHANGING DIMENSIONS
host_id host_name email
1 Alice alice.airbnb@gmail.com
2 Bob bobs.new.address@gmail.com
TYPE-2 SLOWLY CHANGING DIMENSIONS
host_id host_name email dbt_valid_from dbt_valid_to
1 Alice alice.airbnb@gmail.com 2022-01-01 00:00:00 null
2 Bob bob.airbnb@gmail.com 2022-01-01 00:00:00 2022-03-01 12:53:20
3 Bob bobs.new.address@gmail.com 2022-03-01 12:53:20 null
CONFIGURATION AND STRATEGIES
● Snapshots live in the snapshots folder
● Strategies:
○ Timestamp: A unique key and an updated_at field is defined on the
source model. These columns are used for determining changes.
○ Check: Any change in a set of columns (or all columns) will be picked
up as an update.
GUIDED EXERCISE
scd_raw_hosts.sql
Create a new snapshot in the `snapshots/` folder called
`scd_raw_hosts.sql`.
● Set the target table name to scd_raw_hosts
● Set the output schema to dev
● Use the timestamp strategy, figure out the unique key and updated_at
column to use
● Execute `dbt snapshot` and verify that your snapshot has been created
(You can find the solution among the resources)
TESTS
LEARNING OBJECTIVES
● Understand how tests can be defined
● Configure built-in generic tests
● Create your own singular tests
TESTS OVERVIEW
● There are two types of tests: singular and generic
● Singular tests are SQL queries stored in tests which are expected to return an empty resultset
● There are four built-in generic tests:
○ unique
○ not_null
○ accepted_values
○ Relationships
● You can define your own custom generic tests or import tests from dbt packages (will discuss later)
GUIDED EXERCISE
TEST dim_hosts_cleansed
Create a generic tests for the `dim_hosts_cleansed` model.
● host_id: Unique values, no nulls
● host_name shouldn’t contain any null values
● Is_superhost should only contain the values t and f.
● Execute `dbt test` to verify that your tests are passing
● Bonus: Figure out which tests to write for `fct_reviews` and implement
them
(You can find the solution among the resources)
MACROS, CUSTOM
TESTS AND
PACKAGES
LEARNING OBJECTIVES
● Understand how macros are created
● Use macros to implement your own generic tests
● Find and install third-party dbt packages
MACROS, CUSTOM TESTS AND PACKAGES
● Macros are jinja templates created in the macros folder
● There are many built-in macros in DBT
● You can use macros in model definitions and tests
● A special macro, called test, can be used for implementing your own generic tests
● dbt packages can be installed easily to get access to a plethora of macros and tests
DOCUMENTATION
LEARNING OBJECTIVES
● Understand how to document models
● Use the documentation generator and server
● Add assets and markdown to the documentation
● Discuss dev vs. production documentation serving
DOCUMENTATION OVERVIEW
● Documentations can be defined two ways:
○ In yaml files (like schema.yml)
○ In standalone markdown files
● Dbt ships with a lightweight documentation web server
● For customizing the landing page, a special file, overview.md is used
● You can add your own assets (like images) to a special project folder
ANALYSES, HOOKS
AND EXPOSURES
LEARNING OBJECTIVES
● Understand how to store ad-hoc analytical queries in dbt
● Work with dbt hooks to manage table permissions
● Build a dashboard in Preset
● Create a dbt exposure to document the dashboard
HOOKS
● Hooks are SQLs that are executed at predefined times
● Hooks can be configured on the project, subfolder, or model level
● Hook types:
○ on_run_start: executed at the start of dbt {run, seed, snapshot}
○ on_run_end: executed at the end of dbt {run, seed, snapshot}
○ pre-hook: executed before a model/seed/snapshot is built
○ post-hook: executed after a model/seed/snapshot is built
HERO

Complete+dbt+Bootcamp+slides-plus examples

  • 1.
  • 2.
    OBJECTIVES ● Data-maturity model ●dbt and data architectures ● Data warehouses, data lakes, and lakehouses ● ETL and ELT procedures ● dbt fundamentals ● Analytics Engineering
  • 3.
  • 4.
  • 5.
  • 6.
    Maslow's Hierarchy ofNeeds 5 Self-actualization 3 Belongingness & Love Needs 1 Psychological Needs 4 Esteem Needs 2 Safety Needs
  • 7.
    Data-Maturity Model 5 ArtificialIntelligence 3 Data Integration 1 Data Collection 4 BI and Analytics 2 Data Wrangling
  • 8.
    Typical Data Architecture 3Data Integration 1 Data Collection 2 Data Wrangling ML/AI/BI 4
  • 9.
  • 10.
    Data Wrangling 1 DataCollection 2 Data Wrangling
  • 11.
    Data Integration 3 DataIntegration 1 Data Collection 2 Data Wrangling
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    Data Warehouse On-Premise Cloud PNGCredits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png
  • 30.
  • 31.
  • 32.
  • 33.
    Data Lake Unstructured /Structured / Semi-Structured Data PNG Credits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png
  • 34.
  • 35.
    Data Lakehouse PNG Credits:https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png
  • 36.
  • 37.
    SCD Type 0 Notupdating the DWH table when a Dimension changes Source DWH
  • 38.
    SCD Type 1 Updatingthe DWH table when a Dimension changes, overwriting the original data Source DWH No Air-conditioning Installed Air-conditioning DWH updated
  • 39.
    SCD Type 2 Keepingfull history - Adding additional (historic data) rows for each dimension change Source DWH Current rental price ($300) Change in the rental price ($450) DWH updated
  • 40.
    SCD Type 2 Keepingfull history - Adding additional (historic data) rows for each dimension change Source DWH Current rental price ($300) Change in the rental price ($450) DWH updated
  • 41.
    SCD Type 3 Keepinglimited data history - adding separate columns for original and current value Source DWH Listed as Private Host changed Private to Entire Host changed Entire to Shared
  • 42.
  • 43.
    dbt Overview PNG Credits:https://data.solita.fi/wp-content/uploads/2021/06/dbt-transform.png
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
    ANALYTICS ENGINEERING WITHAIRBNB ● Simulating the life of an Analytics Engineer in Airbnb ● Loading, Cleansing, Exposing data ● Writing test, automations and documentation ● Data source: Inside Airbnb: Berlin
  • 50.
  • 51.
    REQUIREMENTS ● Modeling changesare easy to follow and revert ● Explicit dependencies between models ● Explore dependencies between models ● Data quality tests ● Error reporting ● Incremental load of fact tables ● Track history of dimension tables ● Easy-to-access documentation
  • 52.
    NEXT STEPS -SETUP ● Snowflake registration ● Dataset import ● dbt installation ● dbt setup, snowflake connection That’s it :)
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
    Presentation Layer Core Layer Staging Layer Rawlayer raw_listings raw_hosts raw_reviews src_listings basic checks src_hosts basic checks src_reviews: basic checks Hosts: cleansed Listings: cleansed Reviews: cleansed Listings mart table Reviews mart table DATA FLOW OVERVIEW
  • 58.
  • 59.
    INSTALLING DBT ● InstallPython3 virtualenv ● Create a virtualenv ● Activate virtualenv ● Install dbt-snowflake
  • 60.
  • 61.
    INSTALLING DBT ● InstallPython3 ● Create a virtualenv ● Activate virtualenv ● Install dbt and the dbt-snowflake connector
  • 62.
    VSCODE SETUP Installing thedbt power user extension
  • 63.
    DBT SETUP dbt initand connecting to Snowflake
  • 64.
  • 65.
    LEARNING OBJECTIVES ● Understandthe data flow of our project ● Understand the concept of Models in dbt ● Create three basic models: ○ src_listings ○ src_reviews: guided exercises ○ src_hosts: individual lab
  • 66.
    MODELS OVERVIEW ● Modelsare the basic building block of your business logic ● Materialized as tables, views, etc… ● They live in SQL files in the `models` folder ● Models can reference each other and use templates and macros
  • 67.
    Staging Layer Raw layer raw_listings raw_hosts raw_reviews src_listings basicchecks src_hosts basic checks src_reviews basic checks DATA FLOW PROGRESS
  • 68.
    GUIDED EXERCISE src_reviews.sql Create anew model in the `models/src/` folder called `src_reviews.sql`. ● Use a CTE to reference the AIRBNB.RAW.RAW_REVIEWS table ● SELECT every column and every record, and rename the following columns: ○ date to review_date ○ comments to review_text ○ sentiment to review_sentiment ● Execute `dbt run` and verify that your model has been created (You can find the solution among the resources)
  • 69.
  • 70.
    LEARNING OBJECTIVES ● Understandhow models can be connected ● Understand the four built-in materializations ● Understand how materializations can be configured on the file and project level ● Use dbt run with extra parameters
  • 71.
  • 72.
    MATERIALISATIONS OVERVIEW View Use it -You want a lightweight representation - You don’t reuse data too often Don’t use it - You read from the same model several times Table Use it - You read from this model repeatedly Don’t use it - Building single-use models - Your model is populated incrementally Incremental (table appends) Use it - Fact tables - Appends to tables Don’t use it - You want to update historical records Ephemeral (CTEs) Use it - You merely want an alias to your date Don’t use it - You read from the same model several times
  • 73.
    Staging Layer Raw layer raw_listings raw_hosts raw_reviews src_listings basicchecks src_hosts basic checks src_reviews basic checks DATA FLOW PROGRESS Core Layer dim_listings_cleansed cleansing dim_hosts_cleansed cleansing fct_reviews incremental dim_listings_with_hosts final dimension table
  • 74.
    GUIDED EXERCISE dim_hosts_cleansed.sql Create anew model in the `models/dim/` folder called `dim_hosts_cleansed.sql`. ● Use a CTE to reference the `src_hosts` model ● SELECT every column and every record, and add a cleansing step to host_name: ○ If host_name is not null, keep the original value ○ If host_name is null, replace it with the value ‘Anonymous’ ○ Use the NVL(column_name, default_null_value) function ● Execute `dbt run` and verify that your model has been created (You can find the solution among the resources)
  • 75.
  • 76.
    LEARNING OBJECTIVES ● Understandthe difference between seeds and sources ● Understand source-freshness ● Integrate sources into our project
  • 77.
    SOURCES AND SEEDSOVERVIEW ● Seeds are local files that you upload to the data warehouse from dbt ● Sources is an abstraction layer on the top of your input tables ● Source freshness can be checked automatically
  • 78.
  • 79.
    LEARNING OBJECTIVES ● Understandhow dbt handles type-2 slowly changing dimensions ● Understand snapshot strategies ● Learn how to create snapshots on top of our listings and hosts models
  • 80.
  • 81.
    TYPE-2 SLOWLY CHANGINGDIMENSIONS host_id host_name email 1 Alice alice.airbnb@gmail.com 2 Bob bob.airbnb@gmail.com
  • 82.
    TYPE-2 SLOWLY CHANGINGDIMENSIONS host_id host_name email 1 Alice alice.airbnb@gmail.com 2 Bob bobs.new.address@gmail.com
  • 83.
    TYPE-2 SLOWLY CHANGINGDIMENSIONS host_id host_name email dbt_valid_from dbt_valid_to 1 Alice alice.airbnb@gmail.com 2022-01-01 00:00:00 null 2 Bob bob.airbnb@gmail.com 2022-01-01 00:00:00 2022-03-01 12:53:20 3 Bob bobs.new.address@gmail.com 2022-03-01 12:53:20 null
  • 84.
    CONFIGURATION AND STRATEGIES ●Snapshots live in the snapshots folder ● Strategies: ○ Timestamp: A unique key and an updated_at field is defined on the source model. These columns are used for determining changes. ○ Check: Any change in a set of columns (or all columns) will be picked up as an update.
  • 85.
    GUIDED EXERCISE scd_raw_hosts.sql Create anew snapshot in the `snapshots/` folder called `scd_raw_hosts.sql`. ● Set the target table name to scd_raw_hosts ● Set the output schema to dev ● Use the timestamp strategy, figure out the unique key and updated_at column to use ● Execute `dbt snapshot` and verify that your snapshot has been created (You can find the solution among the resources)
  • 86.
  • 87.
    LEARNING OBJECTIVES ● Understandhow tests can be defined ● Configure built-in generic tests ● Create your own singular tests
  • 88.
    TESTS OVERVIEW ● Thereare two types of tests: singular and generic ● Singular tests are SQL queries stored in tests which are expected to return an empty resultset ● There are four built-in generic tests: ○ unique ○ not_null ○ accepted_values ○ Relationships ● You can define your own custom generic tests or import tests from dbt packages (will discuss later)
  • 89.
    GUIDED EXERCISE TEST dim_hosts_cleansed Createa generic tests for the `dim_hosts_cleansed` model. ● host_id: Unique values, no nulls ● host_name shouldn’t contain any null values ● Is_superhost should only contain the values t and f. ● Execute `dbt test` to verify that your tests are passing ● Bonus: Figure out which tests to write for `fct_reviews` and implement them (You can find the solution among the resources)
  • 90.
  • 91.
    LEARNING OBJECTIVES ● Understandhow macros are created ● Use macros to implement your own generic tests ● Find and install third-party dbt packages
  • 92.
    MACROS, CUSTOM TESTSAND PACKAGES ● Macros are jinja templates created in the macros folder ● There are many built-in macros in DBT ● You can use macros in model definitions and tests ● A special macro, called test, can be used for implementing your own generic tests ● dbt packages can be installed easily to get access to a plethora of macros and tests
  • 93.
  • 94.
    LEARNING OBJECTIVES ● Understandhow to document models ● Use the documentation generator and server ● Add assets and markdown to the documentation ● Discuss dev vs. production documentation serving
  • 95.
    DOCUMENTATION OVERVIEW ● Documentationscan be defined two ways: ○ In yaml files (like schema.yml) ○ In standalone markdown files ● Dbt ships with a lightweight documentation web server ● For customizing the landing page, a special file, overview.md is used ● You can add your own assets (like images) to a special project folder
  • 96.
  • 97.
    LEARNING OBJECTIVES ● Understandhow to store ad-hoc analytical queries in dbt ● Work with dbt hooks to manage table permissions ● Build a dashboard in Preset ● Create a dbt exposure to document the dashboard
  • 98.
    HOOKS ● Hooks areSQLs that are executed at predefined times ● Hooks can be configured on the project, subfolder, or model level ● Hook types: ○ on_run_start: executed at the start of dbt {run, seed, snapshot} ○ on_run_end: executed at the end of dbt {run, seed, snapshot} ○ pre-hook: executed before a model/seed/snapshot is built ○ post-hook: executed after a model/seed/snapshot is built
  • 99.