Agent skill
dbt Patterns
Comprehensive guide to dbt (data build tool) patterns, modeling best practices, testing strategies, and production workflows for modern data transformation
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/dbt-patterns
SKILL.md
dbt Patterns
What is dbt?
dbt (data build tool): SQL-first transformation tool that enables analytics engineers to transform data in the warehouse using SELECT statements.
Core Concept
Raw Data (Extract & Load) → dbt (Transform) → Analytics-Ready Data
Traditional ETL: Extract → Transform → Load
Modern ELT: Extract → Load → Transform (with dbt)
Why dbt?
- Version control: SQL as code (Git)
- Testing: Built-in data quality tests
- Documentation: Auto-generated docs
- Modularity: Reusable models
- Lineage: Visual data lineage
- Collaboration: Team workflows
dbt Project Structure
Standard Layout
my_dbt_project/
├── dbt_project.yml # Project configuration
├── profiles.yml # Connection profiles
├── models/ # SQL models
│ ├── staging/ # Raw data cleaning
│ ├── intermediate/ # Business logic
│ └── marts/ # Final analytics tables
├── tests/ # Custom tests
├── macros/ # Reusable SQL
├── seeds/ # CSV reference data
├── snapshots/ # SCD Type 2
└── analyses/ # Ad-hoc queries
dbt_project.yml
name: 'my_project'
version: '1.0.0'
config-version: 2
profile: 'my_profile'
model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
models:
my_project:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
marts:
+materialized: table
+schema: analytics
Model Layers (Staging → Intermediate → Marts)
Staging Layer
Purpose: Clean and standardize raw data
Pattern:
-- models/staging/stg_orders.sql
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
id as order_id,
user_id,
created_at as order_created_at,
status as order_status,
total_amount
from source
)
select * from renamed
Best Practices:
- One staging model per source table
- Rename columns to consistent naming
- Cast data types
- No business logic
- Materialized as views (lightweight)
Intermediate Layer
Purpose: Business logic and transformations
Pattern:
-- models/intermediate/int_orders_with_customer.sql
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
),
joined as (
select
orders.*,
customers.customer_name,
customers.customer_segment
from orders
left join customers
on orders.user_id = customers.customer_id
)
select * from joined
Best Practices:
- Complex joins
- Business logic
- Calculations
- Materialized as ephemeral (not persisted)
Marts Layer
Purpose: Final analytics-ready tables
Pattern:
-- models/marts/fct_orders.sql
with orders as (
select * from {{ ref('int_orders_with_customer') }}
),
aggregated as (
select
order_id,
user_id,
customer_name,
order_created_at,
order_status,
total_amount,
case
when order_status = 'completed' then total_amount
else 0
end as completed_revenue
from orders
)
select * from aggregated
Best Practices:
- Business-friendly naming
- Denormalized for analytics
- Materialized as tables (fast queries)
- Documented
Naming Conventions
Model Naming
Staging: stg_<source>_<table>
stg_salesforce_accounts
stg_stripe_payments
Intermediate: int_<entity>_<verb>
int_orders_joined
int_customers_enriched
Facts: fct_<entity>
fct_orders
fct_revenue
Dimensions: dim_<entity>
dim_customers
dim_products
Column Naming
IDs: <entity>_id
customer_id, order_id
Dates: <entity>_<verb>_at
order_created_at, customer_updated_at
Booleans: is_<condition> or has_<condition>
is_active, has_subscription
Amounts: <entity>_amount
order_amount, refund_amount
Materializations
View
{{ config(materialized='view') }}
select * from {{ ref('stg_orders') }}
- Pros: Always fresh, no storage
- Cons: Slow queries (recomputed each time)
- Use for: Staging models, rarely queried
Table
{{ config(materialized='table') }}
select * from {{ ref('fct_orders') }}
- Pros: Fast queries
- Cons: Storage cost, stale data
- Use for: Marts, frequently queried
Incremental
{{ config(
materialized='incremental',
unique_key='order_id'
) }}
select * from {{ ref('stg_orders') }}
{% if is_incremental() %}
where order_created_at > (select max(order_created_at) from {{ this }})
{% endif %}
- Pros: Fast builds, handles large data
- Cons: Complex logic
- Use for: Large fact tables
Ephemeral
{{ config(materialized='ephemeral') }}
select * from {{ ref('stg_orders') }}
- Pros: No storage, CTE in downstream models
- Cons: Recomputed in each downstream model
- Use for: Intermediate models
Testing
Schema Tests
# models/schema.yml
version: 2
models:
- name: fct_orders
description: "Order facts table"
columns:
- name: order_id
description: "Unique order identifier"
tests:
- unique
- not_null
- name: user_id
description: "Customer who placed order"
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_status
tests:
- accepted_values:
values: ['pending', 'completed', 'cancelled']
- name: total_amount
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
Custom Tests
-- tests/assert_positive_revenue.sql
select *
from {{ ref('fct_orders') }}
where total_amount < 0
Test Types
- unique: No duplicates
- not_null: No nulls
- accepted_values: Value in list
- relationships: Foreign key check
- custom: SQL query returns 0 rows
Macros
Reusable SQL
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
({{ column_name }} / 100.0)::decimal(10,2)
{% endmacro %}
Usage:
select
order_id,
{{ cents_to_dollars('total_amount_cents') }} as total_amount_dollars
from {{ ref('stg_orders') }}
Common Macros
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) %}
{%- if custom_schema_name is none -%}
{{ target.schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{% endmacro %}
Sources
Defining Sources
# models/staging/sources.yml
version: 2
sources:
- name: raw
database: analytics_db
schema: raw_data
tables:
- name: orders
description: "Raw orders from production DB"
columns:
- name: id
tests:
- unique
- not_null
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: _loaded_at
Using Sources
select * from {{ source('raw', 'orders') }}
Source Freshness
dbt source freshness
Documentation
Model Documentation
# models/schema.yml
version: 2
models:
- name: fct_orders
description: |
Order facts table containing all orders with customer information.
This table is updated daily at 2 AM UTC.
**Business Rules:**
- Only includes orders with status 'completed' or 'pending'
- Cancelled orders are excluded
columns:
- name: order_id
description: "Unique identifier for each order"
- name: total_amount
description: "Total order amount in dollars"
Generate Docs
dbt docs generate
dbt docs serve
Output: Interactive documentation website with:
- Model descriptions
- Column descriptions
- Data lineage (DAG)
- Source freshness
Snapshots (SCD Type 2)
Snapshot Configuration
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
)
}}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}
Run Snapshots
dbt snapshot
Output
customer_id | name | updated_at | dbt_valid_from | dbt_valid_to | dbt_scd_id
1 | John | 2024-01-01 | 2024-01-01 | 2024-01-15 | abc123
1 | John | 2024-01-15 | 2024-01-15 | null | def456
Incremental Models
Basic Incremental
{{ config(
materialized='incremental',
unique_key='order_id'
) }}
select
order_id,
user_id,
order_created_at,
total_amount
from {{ ref('stg_orders') }}
{% if is_incremental() %}
-- Only process new/updated records
where order_created_at > (select max(order_created_at) from {{ this }})
{% endif %}
Incremental with Delete+Insert
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='delete+insert'
) }}
Incremental with Merge
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
merge_update_columns=['order_status', 'total_amount']
) }}
Packages
Installing Packages
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.0
- package: calogica/dbt_expectations
version: 0.9.0
- package: dbt-labs/codegen
version: 0.11.0
dbt deps
Using Packages
-- dbt_utils
select
{{ dbt_utils.surrogate_key(['order_id', 'user_id']) }} as unique_key,
*
from {{ ref('stg_orders') }}
-- dbt_expectations
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 1000000
Production Workflows
Development Workflow
# 1. Create feature branch
git checkout -b feature/new-model
# 2. Develop model
# Edit models/marts/fct_new_model.sql
# 3. Run model
dbt run --select fct_new_model
# 4. Test model
dbt test --select fct_new_model
# 5. Document model
# Edit models/schema.yml
# 6. Commit and push
git add .
git commit -m "Add new model"
git push origin feature/new-model
# 7. Create PR
# Review, approve, merge
CI/CD Pipeline
# .github/workflows/dbt_ci.yml
name: dbt CI
on: [pull_request]
jobs:
dbt_run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dbt
run: pip install dbt-snowflake
- name: Run dbt
run: |
dbt deps
dbt run --select state:modified+ --defer --state ./prod_manifest
dbt test --select state:modified+ --defer --state ./prod_manifest
Production Deployment
# Daily production run
dbt run --target prod
dbt test --target prod
dbt source freshness --target prod
Performance Optimization
Optimize Queries
-- Bad: Multiple CTEs with same source
with orders_1 as (
select * from {{ ref('stg_orders') }}
where status = 'completed'
),
orders_2 as (
select * from {{ ref('stg_orders') }}
where status = 'pending'
)
-- Good: Single CTE, filter later
with orders as (
select * from {{ ref('stg_orders') }}
where status in ('completed', 'pending')
),
completed_orders as (
select * from orders where status = 'completed'
),
pending_orders as (
select * from orders where status = 'pending'
)
Use Incremental Models
-- For large tables (millions of rows)
{{ config(materialized='incremental') }}
Partition Tables
{{ config(
materialized='table',
partition_by={
"field": "order_date",
"data_type": "date",
"granularity": "day"
}
) }}
Best Practices
1. One Model Per File
✓ models/staging/stg_orders.sql
✗ models/staging/all_staging_models.sql
2. Use CTEs, Not Subqueries
-- Good
with orders as (
select * from {{ ref('stg_orders') }}
)
select * from orders
-- Bad
select * from (
select * from {{ ref('stg_orders') }}
) as orders
3. Explicit Column Selection
-- Good
select
order_id,
user_id,
total_amount
from {{ ref('stg_orders') }}
-- Bad
select * from {{ ref('stg_orders') }}
4. Test Everything
# Every model should have tests
models:
- name: fct_orders
columns:
- name: order_id
tests: [unique, not_null]
5. Document Everything
# Every model and column should have description
models:
- name: fct_orders
description: "Order facts table"
columns:
- name: order_id
description: "Unique order ID"
Common Patterns
Slowly Changing Dimensions (SCD)
-- Type 1: Overwrite
{{ config(materialized='table') }}
select * from {{ ref('stg_customers') }}
-- Type 2: Historical tracking
{% snapshot customers_snapshot %}
{{ config(strategy='timestamp', updated_at='updated_at') }}
select * from {{ ref('stg_customers') }}
{% endsnapshot %}
Fact Tables
-- Transactional facts
{{ config(materialized='incremental', unique_key='order_id') }}
select
order_id,
customer_id,
product_id,
order_date,
quantity,
amount
from {{ ref('stg_orders') }}
Dimension Tables
-- Dimension table
{{ config(materialized='table') }}
select
customer_id,
customer_name,
customer_segment,
customer_region
from {{ ref('stg_customers') }}
Troubleshooting
Common Errors
Compilation Error:
Compilation Error in model fct_orders
Model 'stg_orders' not found
Fix: Check model name, ensure it exists
Test Failure:
Failure in test unique_fct_orders_order_id
Got 5 results, expected 0
Fix: Investigate duplicate order_ids
Freshness Error:
Source 'raw.orders' is stale (loaded 25 hours ago)
Fix: Check ETL pipeline, data loading
Summary
dbt: SQL-first transformation tool
Layers:
- Staging: Clean raw data
- Intermediate: Business logic
- Marts: Analytics-ready
Materializations:
- View: Always fresh, slow
- Table: Fast, stale
- Incremental: Large data
- Ephemeral: No storage
Testing:
- unique, not_null, accepted_values, relationships
- Custom SQL tests
Best Practices:
- One model per file
- Use CTEs
- Test everything
- Document everything
- Version control
Workflow:
- Develop → Test → Document → PR → Merge → Deploy
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?