Agent skill
ELT Modeling
Comprehensive guide to ELT (Extract, Load, Transform) modeling patterns, dimensional modeling, fact and dimension tables, and data warehouse design
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/elt-modeling
SKILL.md
ELT Modeling
ELT vs ETL
ETL (Traditional)
Extract → Transform → Load
1. Extract from source
2. Transform in ETL tool (Informatica, Talend)
3. Load to warehouse
Pros: Clean data before loading
Cons: Slow, expensive transformation servers
ELT (Modern)
Extract → Load → Transform
1. Extract from source
2. Load raw data to warehouse
3. Transform in warehouse (SQL, dbt)
Pros: Fast loading, leverage warehouse power
Cons: Raw data in warehouse (need governance)
Why ELT?
- Cloud warehouses: Snowflake, BigQuery, Redshift (powerful, scalable)
- Separation of concerns: EL (Fivetran, Airbyte) + T (dbt)
- Flexibility: Transform multiple ways from same raw data
- Speed: Parallel processing in warehouse
Dimensional Modeling
Star Schema
dim_customers
|
dim_products
|
fct_orders ← Central fact table
|
dim_dates
|
dim_locations
Characteristics:
- One fact table (center)
- Multiple dimension tables (points)
- Denormalized (fast queries)
Snowflake Schema
dim_customers
|
dim_customer_segments
|
fct_orders
|
dim_products
|
dim_product_categories
Characteristics:
- Normalized dimensions
- Less redundancy
- More joins (slower queries)
Recommendation: Use star schema for analytics (faster)
Fact Tables
Definition
Fact table: Stores measurable events (transactions, orders, clicks)
Characteristics
- Large: Millions to billions of rows
- Numeric measures: Amounts, quantities, counts
- Foreign keys: Links to dimensions
- Grain: Level of detail (one row per order, per day, etc.)
Types of Facts
Transaction Facts:
-- One row per transaction
fct_orders:
order_id (PK)
customer_id (FK)
product_id (FK)
order_date_id (FK)
quantity
amount
Periodic Snapshot Facts:
-- One row per period (daily, monthly)
fct_inventory_daily:
date_id (PK)
product_id (PK)
warehouse_id (PK)
quantity_on_hand
quantity_sold
Accumulating Snapshot Facts:
-- One row per process (order lifecycle)
fct_order_lifecycle:
order_id (PK)
order_date
payment_date
shipment_date
delivery_date
days_to_ship
days_to_deliver
Fact Table Design
CREATE TABLE fct_orders (
-- Surrogate key
order_key BIGINT PRIMARY KEY,
-- Natural key
order_id VARCHAR(50) NOT NULL,
-- Foreign keys (dimensions)
customer_key BIGINT NOT NULL,
product_key BIGINT NOT NULL,
date_key INT NOT NULL,
-- Degenerate dimensions (no separate dim table)
order_number VARCHAR(50),
-- Measures
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(10,2),
discount_amount DECIMAL(10,2),
-- Audit columns
created_at TIMESTAMP,
updated_at TIMESTAMP
);
Dimension Tables
Definition
Dimension table: Stores descriptive attributes (who, what, where, when)
Characteristics
- Small: Thousands to millions of rows
- Descriptive: Text, categories, hierarchies
- Slowly changing: Updates over time (SCD)
Dimension Design
CREATE TABLE dim_customers (
-- Surrogate key
customer_key BIGINT PRIMARY KEY,
-- Natural key
customer_id VARCHAR(50) NOT NULL,
-- Attributes
customer_name VARCHAR(255),
email VARCHAR(255),
phone VARCHAR(50),
-- Hierarchies
city VARCHAR(100),
state VARCHAR(100),
country VARCHAR(100),
region VARCHAR(100),
-- Segments
customer_segment VARCHAR(50),
customer_tier VARCHAR(50),
-- SCD Type 2 columns
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN,
-- Audit columns
created_at TIMESTAMP,
updated_at TIMESTAMP
);
Slowly Changing Dimensions (SCD)
Type 1: Overwrite
-- Customer moves, update address
UPDATE dim_customers
SET city = 'New York', state = 'NY'
WHERE customer_id = '123';
-- No history preserved
Type 2: Add New Row (Historical)
-- Customer moves, add new row
INSERT INTO dim_customers (
customer_id, city, state,
effective_date, is_current
) VALUES (
'123', 'New York', 'NY',
'2024-01-15', TRUE
);
-- Mark old row as not current
UPDATE dim_customers
SET is_current = FALSE,
expiration_date = '2024-01-14'
WHERE customer_id = '123'
AND is_current = TRUE;
-- History preserved
Type 3: Add New Column
-- Track previous value
ALTER TABLE dim_customers
ADD COLUMN previous_city VARCHAR(100);
UPDATE dim_customers
SET previous_city = city,
city = 'New York'
WHERE customer_id = '123';
Grain Definition
What is Grain?
Grain: Level of detail in fact table (one row represents what?)
Examples
Order grain: One row per order
Order line grain: One row per product in order
Daily grain: One row per day per product
Hourly grain: One row per hour per product
Choosing Grain
Too fine (order line):
- More rows
- More storage
- More flexibility
Too coarse (monthly):
- Fewer rows
- Less storage
- Less flexibility
Balance: Choose finest grain needed for analysis
Kimball vs Inmon
Kimball (Bottom-Up, Dimensional)
Data Marts (Star Schemas) → Enterprise Data Warehouse
Pros:
- Fast to implement
- Business-friendly (denormalized)
- Query performance
Cons:
- Data redundancy
- Hard to change
Inmon (Top-Down, Normalized)
Enterprise Data Warehouse (3NF) → Data Marts
Pros:
- Single source of truth
- Flexible
- Less redundancy
Cons:
- Slow to implement
- Complex queries (many joins)
Modern Approach: Hybrid (normalized staging, dimensional marts)
Data Vault
Components
Hubs: Business keys (customers, products)
Links: Relationships (orders link customers + products)
Satellites: Descriptive attributes (customer details)
Example
-- Hub: Customer
CREATE TABLE hub_customer (
customer_hub_key BIGINT PRIMARY KEY,
customer_id VARCHAR(50) UNIQUE,
load_date TIMESTAMP,
record_source VARCHAR(50)
);
-- Satellite: Customer Details
CREATE TABLE sat_customer_details (
customer_hub_key BIGINT,
load_date TIMESTAMP,
customer_name VARCHAR(255),
email VARCHAR(255),
phone VARCHAR(50),
PRIMARY KEY (customer_hub_key, load_date)
);
-- Link: Order
CREATE TABLE link_order (
order_link_key BIGINT PRIMARY KEY,
customer_hub_key BIGINT,
product_hub_key BIGINT,
load_date TIMESTAMP
);
Pros:
- Audit trail
- Flexible
- Handles source changes
Cons:
- Complex
- Many joins
- Steep learning curve
Conformed Dimensions
Definition
Conformed dimension: Shared across multiple fact tables
Example
-- dim_date used by multiple facts
dim_date ← fct_orders
← fct_shipments
← fct_returns
-- Ensures consistent date attributes
Benefits
- Consistency: Same date logic everywhere
- Drill-across: Compare metrics across facts
- Reusability: Build once, use many times
Bridge Tables
Many-to-Many Relationships
-- Customer can have multiple accounts
-- Account can have multiple customers
CREATE TABLE bridge_customer_account (
customer_key BIGINT,
account_key BIGINT,
allocation_percentage DECIMAL(5,2),
PRIMARY KEY (customer_key, account_key)
);
Junk Dimensions
Definition
Junk dimension: Collection of low-cardinality flags
Example
-- Instead of many boolean columns in fact
CREATE TABLE dim_order_flags (
order_flag_key INT PRIMARY KEY,
is_gift BOOLEAN,
is_express_shipping BOOLEAN,
is_first_order BOOLEAN,
has_discount BOOLEAN
);
-- Fact table references junk dimension
fct_orders:
order_key
customer_key
order_flag_key ← Reference to junk dimension
amount
Role-Playing Dimensions
Definition
Role-playing dimension: Same dimension used multiple times with different meanings
Example
-- dim_date used multiple times
fct_orders:
order_key
order_date_key ← dim_date (order date)
ship_date_key ← dim_date (ship date)
delivery_date_key ← dim_date (delivery date)
amount
Surrogate Keys
Natural Key vs Surrogate Key
-- Natural key: Business identifier
customer_id = 'CUST-12345'
-- Surrogate key: System-generated
customer_key = 1001 (auto-increment or hash)
Why Surrogate Keys?
- Performance: Integer joins faster than string joins
- SCD Type 2: Multiple rows for same natural key
- Independence: Source system can change natural key
- Consistency: Same format across all tables
Generating Surrogate Keys
-- Auto-increment
customer_key BIGINT AUTO_INCREMENT PRIMARY KEY
-- Hash (dbt)
{{ dbt_utils.surrogate_key(['customer_id', 'effective_date']) }}
-- Sequence (PostgreSQL)
customer_key BIGINT DEFAULT nextval('customer_key_seq')
Data Warehouse Layers
Bronze (Raw)
Purpose: Exact copy of source data
Format: As-is from source
Schema: Source schema
Example: raw_salesforce_accounts
Silver (Cleaned)
Purpose: Cleaned, standardized
Format: Consistent types, naming
Schema: Staging schema
Example: stg_salesforce_accounts
Gold (Analytics)
Purpose: Business-ready
Format: Dimensional models
Schema: Analytics schema
Example: dim_customers, fct_orders
Best Practices
1. Define Grain Clearly
✓ "One row per order"
✗ "Order data"
2. Use Surrogate Keys
✓ customer_key BIGINT
✗ customer_id VARCHAR(50)
3. Denormalize Dimensions
✓ dim_customers includes city, state, country
✗ Separate dim_cities, dim_states, dim_countries
4. Keep Facts Narrow
✓ fct_orders: Keys + measures only
✗ fct_orders: Keys + measures + customer name, product name
5. Use Conformed Dimensions
✓ dim_date shared across all facts
✗ Each fact has own date dimension
Common Patterns
Daily Snapshot
-- Capture state once per day
INSERT INTO fct_inventory_daily
SELECT
CURRENT_DATE as snapshot_date,
product_id,
warehouse_id,
quantity_on_hand,
quantity_reserved
FROM current_inventory;
Cumulative Metrics
-- Running totals
SELECT
order_date,
SUM(amount) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_revenue
FROM fct_orders;
Summary
ELT: Extract → Load → Transform (modern approach)
Dimensional Modeling:
- Star schema (recommended)
- Fact tables (measures)
- Dimension tables (attributes)
Fact Types:
- Transaction (one per event)
- Periodic snapshot (one per period)
- Accumulating snapshot (one per process)
SCD Types:
- Type 1: Overwrite
- Type 2: Historical (recommended)
- Type 3: Previous value
Grain: Level of detail (define clearly!)
Surrogate Keys: Use for performance and SCD
Layers:
- Bronze: Raw
- Silver: Cleaned
- Gold: Analytics-ready
Best Practices:
- Define grain
- Use surrogate keys
- Denormalize dimensions
- Keep facts narrow
- Use conformed dimensions
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?