Data Freshness and Latency

Overview

Data Freshness measures how current the data is (age), while Latency measures how long it takes for data to flow through the pipeline (processing time). Both are critical for real-time analytics, operational dashboards, and time-sensitive decision-making.

Core Principle: "Stale data leads to stale decisions. Monitor freshness, optimize latency."

1. Freshness vs. Latency

Metric	Definition	Example	Measurement
Freshness	How old is the data?	Data is 5 minutes old	`NOW() - MAX(event_timestamp)`
Latency	How long does processing take?	Pipeline takes 2 minutes	`processing_end_time - event_timestamp`

Example

Event occurs: 10:00:00
Event arrives in pipeline: 10:00:05 (5 sec ingestion latency)
Processing completes: 10:02:00 (2 min processing latency)
Data queried by user: 10:05:00

Freshness at query time: 5 minutes (10:05 - 10:00)
Total latency: 2 minutes 5 seconds

2. Freshness Requirements by Use Case

Use Case	Freshness SLO	Acceptable Latency	Example
Real-time fraud detection	< 1 second	< 100ms	Credit card transaction scoring
Live dashboards	< 1 minute	< 10 seconds	Website analytics
Operational metrics	< 5 minutes	< 1 minute	Server health monitoring
Business intelligence	< 1 hour	< 15 minutes	Sales reports
Data warehouse	< 24 hours	< 4 hours	Historical analysis
Compliance reporting	< 7 days	Days	Annual audits

3. Measuring Data Freshness

SQL Freshness Check

sql

-- Check freshness of latest record
SELECT 
    MAX(created_at) as latest_record,
    NOW() as current_time,
    EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) / 60 as age_minutes
FROM events;

-- Alert if data is stale (> 10 minutes old)
SELECT 
    CASE 
        WHEN MAX(created_at) < NOW() - INTERVAL '10 minutes' 
        THEN 'STALE'
        ELSE 'FRESH'
    END as freshness_status
FROM events;

Python Freshness Monitoring

python

from datetime import datetime, timedelta
import pandas as pd

def check_freshness(df: pd.DataFrame, timestamp_col: str, max_age_minutes: int = 10):
    """Check if data is fresh enough"""
    latest_timestamp = df[timestamp_col].max()
    age = datetime.now() - latest_timestamp
    age_minutes = age.total_seconds() / 60
    
    is_fresh = age_minutes <= max_age_minutes
    
    return {
        'is_fresh': is_fresh,
        'latest_timestamp': latest_timestamp,
        'age_minutes': age_minutes,
        'threshold_minutes': max_age_minutes
    }

# Usage
result = check_freshness(df, 'event_time', max_age_minutes=10)
if not result['is_fresh']:
    alert(f"Data is stale: {result['age_minutes']} minutes old")

dbt Freshness Tests

yaml

# models/sources.yml
version: 2

sources:
  - name: production
    database: analytics
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    
    tables:
      - name: events
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 10, period: minute}
          error_after: {count: 30, period: minute}

4. Latency Measurement

End-to-End Pipeline Latency

python

def measure_pipeline_latency(event_id: str):
    """Measure latency from event to availability"""
    
    # Get event timestamp from source
    event_time = get_event_timestamp(event_id)
    
    # Get processing completion time
    processed_time = get_processed_timestamp(event_id)
    
    # Calculate latency
    latency = (processed_time - event_time).total_seconds()
    
    # Track percentiles
    latency_metrics.observe(latency)
    
    return {
        'event_id': event_id,
        'event_time': event_time,
        'processed_time': processed_time,
        'latency_seconds': latency
    }

Per-Stage Latency Tracking

python

class PipelineStage:
    def __init__(self, name: str):
        self.name = name
        self.start_time = None
        self.end_time = None
    
    def __enter__(self):
        self.start_time = datetime.now()
        return self
    
    def __exit__(self, *args):
        self.end_time = datetime.now()
        latency = (self.end_time - self.start_time).total_seconds()
        
        # Log to monitoring
        log_metric(f'pipeline.{self.name}.latency', latency)

# Usage
with PipelineStage('ingestion'):
    ingest_data()

with PipelineStage('transformation'):
    transform_data()

with PipelineStage('loading'):
    load_data()

Prometheus Metrics

python

from prometheus_client import Histogram

# Define latency histogram
pipeline_latency = Histogram(
    'pipeline_latency_seconds',
    'Time taken for data to flow through pipeline',
    ['stage', 'source'],
    buckets=[0.1, 0.5, 1, 5, 10, 30, 60, 300]  # seconds
)

# Record latency
with pipeline_latency.labels(stage='transform', source='kafka').time():
    transform_data()

5. Freshness Monitoring and Alerting

Automated Freshness Checks

python

import schedule
import time

def monitor_freshness():
    """Continuously monitor data freshness"""
    tables = ['events', 'users', 'orders']
    
    for table in tables:
        freshness = check_table_freshness(table)
        
        if not freshness['is_fresh']:
            alert(
                severity='warning',
                message=f"Table {table} is stale: {freshness['age_minutes']} minutes old",
                threshold=freshness['threshold_minutes']
            )

# Run every 5 minutes
schedule.every(5).minutes.do(monitor_freshness)

while True:
    schedule.run_pending()
    time.sleep(60)

Watermark Tracking

python

class WatermarkTracker:
    """Track high-water mark for streaming data"""
    
    def __init__(self, table_name: str):
        self.table_name = table_name
        self.watermark = self.load_watermark()
    
    def load_watermark(self) -> datetime:
        """Load last processed timestamp"""
        result = db.execute(
            f"SELECT MAX(processed_at) FROM {self.table_name}_watermark"
        ).fetchone()
        return result[0] if result[0] else datetime.min
    
    def update_watermark(self, timestamp: datetime):
        """Update watermark after processing"""
        db.execute(
            f"INSERT INTO {self.table_name}_watermark (processed_at) VALUES (%s)",
            (timestamp,)
        )
        self.watermark = timestamp
    
    def get_lag(self) -> timedelta:
        """Get lag between watermark and current time"""
        return datetime.now() - self.watermark
    
    def is_lagging(self, threshold_minutes: int = 10) -> bool:
        """Check if processing is lagging"""
        lag_minutes = self.get_lag().total_seconds() / 60
        return lag_minutes > threshold_minutes

6. Improving Freshness

Change Data Capture (CDC)

python

# Debezium CDC example
# Instead of batch ETL every hour, stream changes in real-time

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'dbserver1.inventory.customers',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    change_event = message.value
    
    if change_event['op'] == 'c':  # Create
        insert_to_warehouse(change_event['after'])
    elif change_event['op'] == 'u':  # Update
        update_warehouse(change_event['after'])
    elif change_event['op'] == 'd':  # Delete
        delete_from_warehouse(change_event['before'])

Incremental Updates

sql

-- Instead of full table refresh
-- DELETE FROM target_table;
-- INSERT INTO target_table SELECT * FROM source_table;

-- Use incremental update
INSERT INTO target_table
SELECT * FROM source_table
WHERE updated_at > (SELECT MAX(updated_at) FROM target_table)
ON CONFLICT (id) DO UPDATE SET
    column1 = EXCLUDED.column1,
    updated_at = EXCLUDED.updated_at;

Parallel Processing

python

from concurrent.futures import ThreadPoolExecutor
import pandas as pd

def process_partition(partition_df: pd.DataFrame):
    """Process a partition of data"""
    # Transform and load
    transformed = transform(partition_df)
    load_to_warehouse(transformed)

# Split data into partitions
partitions = np.array_split(large_df, 10)

# Process in parallel
with ThreadPoolExecutor(max_workers=10) as executor:
    executor.map(process_partition, partitions)

7. Trade-offs

Freshness vs. Cost

Real-time streaming (< 1 min freshness):
- Cost: $$$$ (Kafka, Flink, dedicated infrastructure)
- Use when: Fraud detection, live dashboards

Micro-batch (5-15 min freshness):
- Cost: $$ (Spark Streaming, scheduled jobs)
- Use when: Operational metrics, near-real-time analytics

Batch (hourly/daily freshness):
- Cost: $ (Airflow, cron jobs)
- Use when: Reporting, historical analysis

Freshness vs. Completeness

python

# Trade-off: Wait for all data vs. process what we have

def process_with_timeout(timeout_seconds: int = 300):
    """Process data with timeout to ensure freshness"""
    start_time = time.time()
    data_buffer = []
    
    while time.time() - start_time < timeout_seconds:
        new_data = fetch_data()
        data_buffer.extend(new_data)
        
        if is_complete(data_buffer):
            break  # Got all data
    
    # Process what we have, even if incomplete
    if len(data_buffer) > 0:
        process(data_buffer)
    else:
        alert("No data received within timeout")

8. Freshness SLAs and SLOs

Defining SLOs

yaml

# data_freshness_slos.yml
services:
  - name: user_events
    freshness_slo:
      target: 95  # 95% of data should be fresh
      threshold: 5  # within 5 minutes
      measurement_window: 1h
    
  - name: order_analytics
    freshness_slo:
      target: 99
      threshold: 15  # within 15 minutes
      measurement_window: 24h

Measuring SLO Compliance

python

def calculate_freshness_slo(table_name: str, threshold_minutes: int, window_hours: int = 1):
    """Calculate % of data meeting freshness SLO"""
    
    query = f"""
    SELECT 
        COUNT(*) FILTER (
            WHERE created_at > NOW() - INTERVAL '{threshold_minutes} minutes'
        )::FLOAT / COUNT(*) * 100 as freshness_percent
    FROM {table_name}
    WHERE created_at > NOW() - INTERVAL '{window_hours} hours'
    """
    
    result = db.execute(query).fetchone()
    freshness_percent = result[0]
    
    return {
        'table': table_name,
        'freshness_percent': freshness_percent,
        'threshold_minutes': threshold_minutes,
        'meets_slo': freshness_percent >= 95  # 95% target
    }

9. Tools for Freshness Monitoring

Monte Carlo Freshness Checks

yaml

# Monte Carlo automatically monitors freshness
monitors:
  - type: freshness
    table: production.events
    field: created_at
    threshold: 10 minutes
    alert:
      - slack: #data-alerts
      - pagerduty: data-team

Custom Grafana Dashboard

promql

# Prometheus query for freshness
time() - max(event_timestamp) by (table)

# Alert rule
ALERT DataStale
IF (time() - max(event_timestamp)) > 600  # 10 minutes
FOR 5m
LABELS { severity="warning" }
ANNOTATIONS {
  summary="Data is stale in {{ $labels.table }}",
  description="Latest data is {{ $value }}s old"
}

10. Handling Stale Data

Fallback to Cached Data

python

def get_data_with_fallback(cache_ttl_minutes: int = 60):
    """Get fresh data or fall back to cache"""
    
    # Try to get fresh data
    fresh_data = fetch_from_warehouse()
    freshness = check_freshness(fresh_data, 'updated_at', max_age_minutes=10)
    
    if freshness['is_fresh']:
        # Update cache
        cache.set('latest_data', fresh_data, ttl=cache_ttl_minutes * 60)
        return fresh_data
    else:
        # Fall back to cache
        cached_data = cache.get('latest_data')
        if cached_data:
            logger.warning(f"Using cached data (warehouse data is stale)")
            return cached_data
        else:
            raise DataUnavailableError("No fresh or cached data available")

Display Staleness to Users

python

def get_dashboard_data():
    """Get data with freshness indicator"""
    data = fetch_data()
    freshness = check_freshness(data, 'event_time')
    
    return {
        'data': data,
        'metadata': {
            'last_updated': freshness['latest_timestamp'],
            'age_minutes': freshness['age_minutes'],
            'is_fresh': freshness['is_fresh'],
            'warning': f"Data is {freshness['age_minutes']:.0f} minutes old" if not freshness['is_fresh'] else None
        }
    }

11. Real Freshness Issues

Case Study: The Stale Dashboard

Problem: Executive dashboard showing yesterday's revenue
Root Cause: ETL job failed at 2 AM, no alerting on freshness
Impact: Wrong business decisions made based on stale data
Solution: Added freshness monitoring with PagerDuty alerts
Prevention: Implemented SLO tracking and automated freshness tests

Case Study: The Slow Pipeline

Problem: Real-time fraud detection taking 5 minutes (SLO: < 1 second)
Root Cause: Single-threaded processing, no partitioning
Solution: Implemented Kafka partitioning and parallel consumers
Result: Latency reduced from 5 minutes to 200ms

12. Data Freshness Checklist

SLOs Defined: Do we have freshness SLOs for each critical table?
Monitoring: Are we continuously monitoring freshness?
Alerting: Do we get alerted when data goes stale?
Latency Tracking: Are we measuring P50/P95/P99 latencies?
Optimization: Have we optimized for our freshness requirements?
Fallbacks: Do we have fallback strategies for stale data?
User Communication: Do we show data age to end users?
SLO Compliance: Are we meeting our freshness SLOs > 95% of the time?

Related Skills

43-data-reliability/data-quality-monitoring
43-data-reliability/data-contracts
42-cost-engineering/infra-sizing

Search AI Tools

Install this agent skill to your Project

SKILL.md

Data Freshness and Latency

Overview

1. Freshness vs. Latency

Example

2. Freshness Requirements by Use Case

3. Measuring Data Freshness

SQL Freshness Check

Python Freshness Monitoring

dbt Freshness Tests

4. Latency Measurement

End-to-End Pipeline Latency

Per-Stage Latency Tracking

Prometheus Metrics

5. Freshness Monitoring and Alerting

Automated Freshness Checks

Watermark Tracking

6. Improving Freshness

Change Data Capture (CDC)

Incremental Updates

Parallel Processing

7. Trade-offs

Freshness vs. Cost

Freshness vs. Completeness

8. Freshness SLAs and SLOs

Defining SLOs

Measuring SLO Compliance

9. Tools for Freshness Monitoring

Monte Carlo Freshness Checks

Custom Grafana Dashboard

10. Handling Stale Data

Fallback to Cached Data

Display Staleness to Users

11. Real Freshness Issues

Case Study: The Stale Dashboard

Case Study: The Slow Pipeline

12. Data Freshness Checklist

Related Skills