Agent skill

ELT Modeling

Comprehensive guide to ELT (Extract, Load, Transform) modeling patterns, dimensional modeling, fact and dimension tables, and data warehouse design

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/elt-modeling

SKILL.md

ELT Modeling

ELT vs ETL

ETL (Traditional)

Extract → Transform → Load

1. Extract from source
2. Transform in ETL tool (Informatica, Talend)
3. Load to warehouse

Pros: Clean data before loading
Cons: Slow, expensive transformation servers

ELT (Modern)

Extract → Load → Transform

1. Extract from source
2. Load raw data to warehouse
3. Transform in warehouse (SQL, dbt)

Pros: Fast loading, leverage warehouse power
Cons: Raw data in warehouse (need governance)

Why ELT?

  • Cloud warehouses: Snowflake, BigQuery, Redshift (powerful, scalable)
  • Separation of concerns: EL (Fivetran, Airbyte) + T (dbt)
  • Flexibility: Transform multiple ways from same raw data
  • Speed: Parallel processing in warehouse

Dimensional Modeling

Star Schema

        dim_customers
              |
        dim_products
              |
         fct_orders  ← Central fact table
              |
        dim_dates
              |
        dim_locations

Characteristics:

  • One fact table (center)
  • Multiple dimension tables (points)
  • Denormalized (fast queries)

Snowflake Schema

    dim_customers
          |
    dim_customer_segments
          |
     fct_orders
          |
    dim_products
          |
    dim_product_categories

Characteristics:

  • Normalized dimensions
  • Less redundancy
  • More joins (slower queries)

Recommendation: Use star schema for analytics (faster)


Fact Tables

Definition

Fact table: Stores measurable events (transactions, orders, clicks)

Characteristics

  • Large: Millions to billions of rows
  • Numeric measures: Amounts, quantities, counts
  • Foreign keys: Links to dimensions
  • Grain: Level of detail (one row per order, per day, etc.)

Types of Facts

Transaction Facts:

sql
-- One row per transaction
fct_orders:
  order_id (PK)
  customer_id (FK)
  product_id (FK)
  order_date_id (FK)
  quantity
  amount

Periodic Snapshot Facts:

sql
-- One row per period (daily, monthly)
fct_inventory_daily:
  date_id (PK)
  product_id (PK)
  warehouse_id (PK)
  quantity_on_hand
  quantity_sold

Accumulating Snapshot Facts:

sql
-- One row per process (order lifecycle)
fct_order_lifecycle:
  order_id (PK)
  order_date
  payment_date
  shipment_date
  delivery_date
  days_to_ship
  days_to_deliver

Fact Table Design

sql
CREATE TABLE fct_orders (
  -- Surrogate key
  order_key BIGINT PRIMARY KEY,
  
  -- Natural key
  order_id VARCHAR(50) NOT NULL,
  
  -- Foreign keys (dimensions)
  customer_key BIGINT NOT NULL,
  product_key BIGINT NOT NULL,
  date_key INT NOT NULL,
  
  -- Degenerate dimensions (no separate dim table)
  order_number VARCHAR(50),
  
  -- Measures
  quantity INT,
  unit_price DECIMAL(10,2),
  total_amount DECIMAL(10,2),
  discount_amount DECIMAL(10,2),
  
  -- Audit columns
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

Dimension Tables

Definition

Dimension table: Stores descriptive attributes (who, what, where, when)

Characteristics

  • Small: Thousands to millions of rows
  • Descriptive: Text, categories, hierarchies
  • Slowly changing: Updates over time (SCD)

Dimension Design

sql
CREATE TABLE dim_customers (
  -- Surrogate key
  customer_key BIGINT PRIMARY KEY,
  
  -- Natural key
  customer_id VARCHAR(50) NOT NULL,
  
  -- Attributes
  customer_name VARCHAR(255),
  email VARCHAR(255),
  phone VARCHAR(50),
  
  -- Hierarchies
  city VARCHAR(100),
  state VARCHAR(100),
  country VARCHAR(100),
  region VARCHAR(100),
  
  -- Segments
  customer_segment VARCHAR(50),
  customer_tier VARCHAR(50),
  
  -- SCD Type 2 columns
  effective_date DATE,
  expiration_date DATE,
  is_current BOOLEAN,
  
  -- Audit columns
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

Slowly Changing Dimensions (SCD)

Type 1: Overwrite

sql
-- Customer moves, update address
UPDATE dim_customers
SET city = 'New York', state = 'NY'
WHERE customer_id = '123';

-- No history preserved

Type 2: Add New Row (Historical)

sql
-- Customer moves, add new row
INSERT INTO dim_customers (
  customer_id, city, state,
  effective_date, is_current
) VALUES (
  '123', 'New York', 'NY',
  '2024-01-15', TRUE
);

-- Mark old row as not current
UPDATE dim_customers
SET is_current = FALSE,
    expiration_date = '2024-01-14'
WHERE customer_id = '123'
  AND is_current = TRUE;

-- History preserved

Type 3: Add New Column

sql
-- Track previous value
ALTER TABLE dim_customers
ADD COLUMN previous_city VARCHAR(100);

UPDATE dim_customers
SET previous_city = city,
    city = 'New York'
WHERE customer_id = '123';

Grain Definition

What is Grain?

Grain: Level of detail in fact table (one row represents what?)

Examples

Order grain:      One row per order
Order line grain: One row per product in order
Daily grain:      One row per day per product
Hourly grain:     One row per hour per product

Choosing Grain

Too fine (order line):
- More rows
- More storage
- More flexibility

Too coarse (monthly):
- Fewer rows
- Less storage
- Less flexibility

Balance: Choose finest grain needed for analysis

Kimball vs Inmon

Kimball (Bottom-Up, Dimensional)

Data Marts (Star Schemas) → Enterprise Data Warehouse

Pros:
- Fast to implement
- Business-friendly (denormalized)
- Query performance

Cons:
- Data redundancy
- Hard to change

Inmon (Top-Down, Normalized)

Enterprise Data Warehouse (3NF) → Data Marts

Pros:
- Single source of truth
- Flexible
- Less redundancy

Cons:
- Slow to implement
- Complex queries (many joins)

Modern Approach: Hybrid (normalized staging, dimensional marts)


Data Vault

Components

Hubs:       Business keys (customers, products)
Links:      Relationships (orders link customers + products)
Satellites: Descriptive attributes (customer details)

Example

sql
-- Hub: Customer
CREATE TABLE hub_customer (
  customer_hub_key BIGINT PRIMARY KEY,
  customer_id VARCHAR(50) UNIQUE,
  load_date TIMESTAMP,
  record_source VARCHAR(50)
);

-- Satellite: Customer Details
CREATE TABLE sat_customer_details (
  customer_hub_key BIGINT,
  load_date TIMESTAMP,
  customer_name VARCHAR(255),
  email VARCHAR(255),
  phone VARCHAR(50),
  PRIMARY KEY (customer_hub_key, load_date)
);

-- Link: Order
CREATE TABLE link_order (
  order_link_key BIGINT PRIMARY KEY,
  customer_hub_key BIGINT,
  product_hub_key BIGINT,
  load_date TIMESTAMP
);

Pros:

  • Audit trail
  • Flexible
  • Handles source changes

Cons:

  • Complex
  • Many joins
  • Steep learning curve

Conformed Dimensions

Definition

Conformed dimension: Shared across multiple fact tables

Example

sql
-- dim_date used by multiple facts
dim_date ← fct_orders
        ← fct_shipments
        ← fct_returns

-- Ensures consistent date attributes

Benefits

  • Consistency: Same date logic everywhere
  • Drill-across: Compare metrics across facts
  • Reusability: Build once, use many times

Bridge Tables

Many-to-Many Relationships

sql
-- Customer can have multiple accounts
-- Account can have multiple customers

CREATE TABLE bridge_customer_account (
  customer_key BIGINT,
  account_key BIGINT,
  allocation_percentage DECIMAL(5,2),
  PRIMARY KEY (customer_key, account_key)
);

Junk Dimensions

Definition

Junk dimension: Collection of low-cardinality flags

Example

sql
-- Instead of many boolean columns in fact
CREATE TABLE dim_order_flags (
  order_flag_key INT PRIMARY KEY,
  is_gift BOOLEAN,
  is_express_shipping BOOLEAN,
  is_first_order BOOLEAN,
  has_discount BOOLEAN
);

-- Fact table references junk dimension
fct_orders:
  order_key
  customer_key
  order_flag_key  ← Reference to junk dimension
  amount

Role-Playing Dimensions

Definition

Role-playing dimension: Same dimension used multiple times with different meanings

Example

sql
-- dim_date used multiple times
fct_orders:
  order_key
  order_date_key      ← dim_date (order date)
  ship_date_key       ← dim_date (ship date)
  delivery_date_key   ← dim_date (delivery date)
  amount

Surrogate Keys

Natural Key vs Surrogate Key

sql
-- Natural key: Business identifier
customer_id = 'CUST-12345'

-- Surrogate key: System-generated
customer_key = 1001 (auto-increment or hash)

Why Surrogate Keys?

  • Performance: Integer joins faster than string joins
  • SCD Type 2: Multiple rows for same natural key
  • Independence: Source system can change natural key
  • Consistency: Same format across all tables

Generating Surrogate Keys

sql
-- Auto-increment
customer_key BIGINT AUTO_INCREMENT PRIMARY KEY

-- Hash (dbt)
{{ dbt_utils.surrogate_key(['customer_id', 'effective_date']) }}

-- Sequence (PostgreSQL)
customer_key BIGINT DEFAULT nextval('customer_key_seq')

Data Warehouse Layers

Bronze (Raw)

Purpose: Exact copy of source data
Format: As-is from source
Schema: Source schema
Example: raw_salesforce_accounts

Silver (Cleaned)

Purpose: Cleaned, standardized
Format: Consistent types, naming
Schema: Staging schema
Example: stg_salesforce_accounts

Gold (Analytics)

Purpose: Business-ready
Format: Dimensional models
Schema: Analytics schema
Example: dim_customers, fct_orders

Best Practices

1. Define Grain Clearly

✓ "One row per order"
✗ "Order data"

2. Use Surrogate Keys

✓ customer_key BIGINT
✗ customer_id VARCHAR(50)

3. Denormalize Dimensions

✓ dim_customers includes city, state, country
✗ Separate dim_cities, dim_states, dim_countries

4. Keep Facts Narrow

✓ fct_orders: Keys + measures only
✗ fct_orders: Keys + measures + customer name, product name

5. Use Conformed Dimensions

✓ dim_date shared across all facts
✗ Each fact has own date dimension

Common Patterns

Daily Snapshot

sql
-- Capture state once per day
INSERT INTO fct_inventory_daily
SELECT
  CURRENT_DATE as snapshot_date,
  product_id,
  warehouse_id,
  quantity_on_hand,
  quantity_reserved
FROM current_inventory;

Cumulative Metrics

sql
-- Running totals
SELECT
  order_date,
  SUM(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) as cumulative_revenue
FROM fct_orders;

Summary

ELT: Extract → Load → Transform (modern approach)

Dimensional Modeling:

  • Star schema (recommended)
  • Fact tables (measures)
  • Dimension tables (attributes)

Fact Types:

  • Transaction (one per event)
  • Periodic snapshot (one per period)
  • Accumulating snapshot (one per process)

SCD Types:

  • Type 1: Overwrite
  • Type 2: Historical (recommended)
  • Type 3: Previous value

Grain: Level of detail (define clearly!)

Surrogate Keys: Use for performance and SCD

Layers:

  • Bronze: Raw
  • Silver: Cleaned
  • Gold: Analytics-ready

Best Practices:

  • Define grain
  • Use surrogate keys
  • Denormalize dimensions
  • Keep facts narrow
  • Use conformed dimensions

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results