PolicyEngine Simulation Mechanics

This skill covers advanced patterns for working with policyengine.py simulations, including caching, result access, and entity mapping.

CRITICAL: Environment Setup

Before writing any code, check the environment. The policyengine.py package must be installed in the project's .venv.

bash

# Always run from the policyengine.py repo root:
cd /path/to/policyengine.py
uv run python script.py

# Or activate first:
source .venv/bin/activate
python script.py

# NEVER use bare `pip install` — always:
uv pip install -e ".[uk]"   # for UK work
uv pip install -e ".[us]"   # for US work

If from policyengine.core import Simulation fails:

bash

cd /path/to/policyengine.py
uv pip install -e ".[uk]"
# Then re-run with: uv run python script.py

CRITICAL: Correct Import Paths

Only these imports exist — do not guess others:

python

# Core simulation
from policyengine.core import Simulation

# UK model
from policyengine.tax_benefit_models.uk import (
    uk_latest,          # The model version (pass as tax_benefit_model_version=)
    uk_model,           # The model itself
    PolicyEngineUKDataset,
    UKYearData,
    create_datasets,    # Create & cache datasets from HF source
    load_datasets,      # Load cached datasets from disk
    ensure_datasets,    # Create if missing, load if present (recommended)
)

# US model
from policyengine.tax_benefit_models.us import (
    us_latest,
    PolicyEngineUSDataset,
    ensure_datasets,
)

# Outputs
from policyengine.outputs.aggregate import Aggregate, AggregateType
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType

# Plotting
from policyengine.utils.plotting import COLORS, format_fig

There is NO:

policyengine.core.dataset_registry
policyengine.datasets
policyengine.core.dataset_version.DatasetVersion.list()

UK Datasets

Loading UK datasets

Use ensure_datasets() — it returns a dict[str, PolicyEngineUKDataset], building files in ./data/ on first run and loading from disk on subsequent runs.

WARNING: from policyengine.tax_benefit_models.uk import datasets gives you the Python submodule, not a dict. Never index it like a dict.

python

from policyengine.tax_benefit_models.uk import ensure_datasets

uk = ensure_datasets(
    datasets=[
        "hf://policyengine/policyengine-uk-data/frs_2023_24.h5",
        "hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5",
    ],
    years=[2026],
    data_folder="./data",
)

efrs = uk["enhanced_frs_2023_24_2026"]
frs  = uk["frs_2023_24_2026"]

Dict key format: "{stem}_{year}" e.g. "enhanced_frs_2023_24_2026"

To force regeneration: delete ./data/ and call ensure_datasets() again.

Loading US datasets

python

from policyengine.tax_benefit_models.us import ensure_datasets

us = ensure_datasets(
    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
    years=[2026],
    data_folder="./data",
)

ecps = us["enhanced_cps_2024_2026"]

Default US dataset: enhanced_cps_2024.h5 (Enhanced CPS), years 2024–2028.

Inspecting available variables

Always inspect the dataset to find available variable names — never guess:

python

from policyengine.tax_benefit_models.uk import ensure_datasets

uk = ensure_datasets(years=[2026], data_folder="./data")
d = uk["enhanced_frs_2023_24_2026"]

# Input variables (present in raw data)
print("household:", list(d.data.household.columns))
print("person:   ", list(d.data.person.columns))
print("benunit:  ", list(d.data.benunit.columns))

Input variables are what's in the raw survey data — demographics, reported incomes, consumption, wealth, flags.

Computed variables (household_net_income, income_tax, universal_credit, etc.) are not in the raw dataset — they are calculated by the simulation. To see what's available after running:

python

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest

sim = Simulation(dataset=d, tax_benefit_model_version=uk_latest)
sim.run()

print("household (post-sim):", list(sim.output_dataset.data.household.columns))
print("person (post-sim):   ", list(sim.output_dataset.data.person.columns))

The computed variables available are defined by uk_latest.entity_variables — inspect this to see the full list without running a simulation:

python

from policyengine.tax_benefit_models.uk import uk_latest
print(uk_latest.entity_variables)  # dict: entity → [variable names]

For Analysts: Core Concepts

When running simulations with policyengine.py (the microsimulation package, not the API client), you work with three key components:

Simulation.ensure() - Smart caching to avoid redundant computation
simulation.output_dataset.data - Accessing calculated results
map_to_entity() - Converting data between entity levels (person ↔ household)

Note: This is for microsimulation with policyengine.py, not the policyengine Python API client (which uses Simulation(situation=...)).

Simulation Lifecycle

The Four Methods

python

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)

# Method 1: Always run (no caching)
simulation.run()

# Method 2: Run only if needed (recommended)
simulation.ensure()

# Method 3: Save results to disk
simulation.save()

# Method 4: Load results from disk
simulation.load()

When to Use Each

run(): Use when you need fresh results or parameters changed
ensure(): Use for iterative development (checks cache → disk → run)
save(): Use to persist large simulation results
load(): Use to resume from previous session

How ensure() Works

python

def ensure(self):
    # 1. Check in-memory LRU cache (100 simulations)
    cached = _cache.get(self.id)
    if cached:
        self.output_dataset = cached.output_dataset
        return

    # 2. Try loading from disk
    try:
        self.tax_benefit_model_version.load(self)
    except Exception:
        # 3. Only run if both cache and disk fail
        self.run()
        self.save()

    # 4. Add to cache for next ensure() call
    _cache.add(self.id, self)

Performance impact:

First call: Full simulation runtime (seconds to minutes)
Same session: Instant (in-memory cache)
New session: Fast (disk load, no recomputation)

Example: Reusing Baseline Across Reforms

python

# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()  # First call: runs simulation
baseline.save()    # Persist to disk

# Test multiple reforms
for reform in [reform1, reform2, reform3]:
    baseline.ensure()  # Instant from cache!

    reform_sim = Simulation(
        dataset=dataset,
        tax_benefit_model_version=uk_latest,
        policy=reform
    )
    reform_sim.run()  # Only reform needs to run

    # Compare results...

Accessing Results: output_dataset.data

After running a simulation, all calculated variables are in simulation.output_dataset.data.

Structure (UK Example)

python

simulation.run()

# Access output container
output = simulation.output_dataset.data

# Entity-level MicroDataFrames
output.person      # Person-level results
output.benunit     # Benefit unit results
output.household   # Household-level results

US Entity Structure

python

# US has more entities
output.person
output.tax_unit       # Federal tax filing unit
output.spm_unit       # Supplemental Poverty Measure unit
output.family         # Census family definition
output.marital_unit   # Married couple or single
output.household

Available Variables

Each dataframe contains input variables + calculated variables:

python

# Person-level (UK)
print(output.person.columns)
# ['person_id', 'person_household_id', 'age', 'employment_income',
#  'income_tax', 'national_insurance', 'net_income', ...]

# Household-level (UK)
print(output.household.columns)
# ['household_id', 'region', 'rent', 'household_net_income',
#  'household_benefits', 'household_tax', ...]

# Benunit-level (UK)
print(output.benunit.columns)
# ['benunit_id', 'universal_credit', 'child_benefit',
#  'working_tax_credit', 'child_tax_credit', ...]

Direct Data Access

python

# Get specific columns
incomes = output.household[["household_id", "household_net_income"]]

# Filter data
high_earners = output.person[output.person["employment_income"] > 100000]

# Calculate statistics (automatically weighted!)
mean_income = output.household["household_net_income"].mean()
total_tax = output.household["household_tax"].sum()

# Access individual values
first_hh_income = output.household["household_net_income"].iloc[0]

MicroDataFrame Automatic Weighting

All operations respect survey weights automatically:

python

# These are all weighted calculations
total_population = output.person["person_weight"].sum()
mean_income = output.household["household_net_income"].mean()
poverty_rate = output.household["in_absolute_poverty_bhc"].mean()

# Groupby operations are weighted
by_region = output.household.groupby("region")["household_net_income"].mean()

Entity Mapping with map_to_entity()

Convert data between entity levels (e.g., sum person income to household, or broadcast household rent to persons).

Method Signature

python

output.map_to_entity(
    source_entity: str,        # Entity to map from
    target_entity: str,        # Entity to map to
    columns: list[str] = None, # Columns to map (None = all)
    values: np.ndarray = None, # Custom values instead
    how: str = "sum"           # Aggregation method
)

Aggregation Methods

Person → Group (aggregation):

how="sum" (default): Sum values within each group
how="first": Take first value in each group
how="mean": Average values
how="max": Maximum value
how="min": Minimum value

Group → Person (expansion):

how="project" (default): Broadcast group value to all members
how="divide": Split group value equally among members

Example 1: Sum Person Income to Household

python

# Sum employment income across all people in each household
household_employment = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)

# Result is MicroDataFrame at household level
print(household_employment.columns)
# ['household_id', 'employment_income']  # Now household total

Example 2: Broadcast Household Rent to Persons

python

# Give each person their household's rent value
person_rent = output.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["rent"],
    how="project"
)

# Each person now has their household's rent
print(person_rent.columns)
# ['person_id', 'rent']

Example 3: Divide Household Value Per Person

python

# Split household savings equally among members
person_savings_share = output.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["total_savings"],
    how="divide"
)

# If household has £12,000 savings and 3 people, each gets £4,000

Example 4: Map Custom Values

python

import numpy as np

# Calculate custom person-level values
custom_tax = np.where(
    output.person["employment_income"] > 50000,
    output.person["income_tax"] * 1.1,  # 10% increase for high earners
    output.person["income_tax"]
)

# Aggregate to household level
household_custom_tax = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=custom_tax,
    how="sum"
)

Example 5: Multi-Column Mapping

python

# Map multiple income sources to household level
household_incomes = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=[
        "employment_income",
        "self_employment_income",
        "pension_income",
        "savings_interest_income"
    ],
    how="sum"
)

# Result has all columns at household level

Example 6: Cross-Entity Mapping (Group to Group)

python

# UK: Map benunit benefits to household level
# (Multiple benunits can exist in one household)
household_uc = output.map_to_entity(
    source_entity="benunit",
    target_entity="household",
    columns=["universal_credit", "child_benefit"],
    how="sum"
)

Automatic Mapping in Aggregate Classes

The Aggregate and ChangeAggregate classes automatically handle entity mapping when the variable and target entity don't match:

python

from policyengine.outputs.aggregate import Aggregate, AggregateType

# income_tax is person-level, but we want household-level sum
total_tax = Aggregate(
    simulation=simulation,
    variable="income_tax",  # Person-level
    entity="household",      # Household-level aggregation
    aggregate_type=AggregateType.SUM,
)
total_tax.run()
# Automatically maps income_tax from person to household using sum()

Common Patterns

Pattern 1: Compare Baseline vs Reform

python

# Run both simulations
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()

reform = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
    policy=reform_policy
)
reform.ensure()

# Get outputs
baseline_out = baseline.output_dataset.data
reform_out = reform.output_dataset.data

# Calculate differences
baseline_income = baseline_out.household["household_net_income"]
reform_income = reform_out.household["household_net_income"]

difference = reform_income - baseline_income

# Count winners/losers (weighted)
winners = (difference > 0).sum()
losers = (difference < 0).sum()
unchanged = (difference == 0).sum()

Pattern 2: Calculate Custom Derived Variable

python

# Calculate marginal tax rate at person level
person_data = output.person.copy()
person_data["mtr"] = (
    (person_data["income_tax"] + person_data["national_insurance"])
    / person_data["employment_income"].clip(lower=1)
) * 100

# Map to household level (max MTR in household)
household_mtr = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=person_data["mtr"].values,
    how="max"
)

Pattern 3: Extract Subset for Analysis

python

# Get London households with children
london_hh = output.household[output.household["region"] == "LONDON"]

households_with_children = output.person.groupby("person_household_id")["age"].apply(
    lambda ages: (ages < 18).any()
)

# Combine filters
london_ids = set(london_hh["household_id"])
hh_with_kids_ids = set(households_with_children[households_with_children].index)
target_ids = london_ids & hh_with_kids_ids

# Extract subset
subset_hh = output.household[output.household["household_id"].isin(target_ids)]
subset_persons = output.person[output.person["person_household_id"].isin(target_ids)]

Pattern 4: Reuse Baseline Across Multiple Reforms

python

# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
baseline.save()

# Test multiple reforms efficiently
reforms = [reform1, reform2, reform3]
results = {}

for reform in reforms:
    baseline.ensure()  # Instant from cache

    reform_sim = Simulation(
        dataset=dataset,
        tax_benefit_model_version=uk_latest,
        policy=reform
    )
    reform_sim.run()

    # Calculate impact
    from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType

    revenue = ChangeAggregate(
        baseline_simulation=baseline,
        reform_simulation=reform_sim,
        variable="household_tax",
        aggregate_type=ChangeAggregateType.SUM,
    )
    revenue.run()

    results[reform.name] = revenue.result

Direct Data Analysis (without Aggregate)

For custom analyses (decile breakdowns, percentiles, groupby), work directly with output_dataset.data after running the simulation. This is often simpler than using Aggregate.

Full working example: energy spending by income decile and tenure type

python

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset, uk_latest

# Load dataset
dataset = PolicyEngineUKDataset(
    name="Enhanced FRS 2026",
    description="EFRS 2026",
    filepath="./data/enhanced_frs_2023_24_year_2026.h5",
    year=2026,
)
dataset.load()

# Run simulation to compute income variables
simulation = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
simulation.ensure()  # caches to disk after first run

# Access results as DataFrames
hh = simulation.output_dataset.data.household

# Assign income decile (weighted quantile)
hh["income_decile"] = pd.qcut(
    hh["household_net_income"],
    q=10,
    labels=[f"D{i}" for i in range(1, 11)],
)

# Group and calculate stats
stats = (
    hh.groupby(["income_decile", "tenure_type"])["domestic_energy_consumption"]
    .agg(
        mean="mean",
        p25=lambda x: np.percentile(x, 25),
        p75=lambda x: np.percentile(x, 75),
    )
    .reset_index()
)

Key points:

simulation.output_dataset.data.household is a MicroDataFrame with weights
domestic_energy_consumption is household-level (annual £)
tenure_type values: OWNED_OUTRIGHT, OWNED_WITH_MORTGAGE, RENT_FROM_COUNCIL, RENT_PRIVATELY, RENT_FROM_HA
Income deciles must be computed from simulation output (not raw data)

Performance Tips

Use ensure() for iterative work: Can save minutes when re-running analyses
Filter before mapping: Reduces computation on large datasets
Use Aggregate classes: Optimised implementations for common operations
Batch similar calculations: Run multiple aggregates in sequence
Cache intermediate results: Store derived calculations

python

# Good: Filter then map
high_earners = output.person[output.person["employment_income"] > 100000]
high_earner_hh_income = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=high_earners["employment_income"].values,
    how="sum"
)

# Less efficient: Map then filter
all_hh_income = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)
high_earner_hh = all_hh_income[all_hh_income["employment_income"] > 100000]

For Contributors: Implementation

Current implementation:

bash

# Simulation lifecycle
cat policyengine.py/src/policyengine/core/simulation.py

# Entity mapping logic
cat policyengine.py/src/policyengine/core/dataset.py

# Cache implementation
cat policyengine.py/src/policyengine/core/cache.py

Key patterns:

Simulation caching: LRU cache with max 100 entries, keyed by UUID
Entity mapping: Automatic detection of mapping direction (person→group or group→person)
MicroDataFrame: All entity data uses weighted DataFrames from microdf package

Related skills:

policyengine-core-skill - Understanding simulation engine architecture
microdf-skill - Working with weighted DataFrames
policyengine-python-client-skill - Basic simulation usage

Debugging Tips

Verify Simulation Ran

python

assert simulation.output_dataset is not None, "Simulation hasn't run"

# Check for expected variables
expected = ["household_net_income", "household_tax"]
actual = simulation.output_dataset.data.household.columns
assert all(v in actual for v in expected), "Missing variables"

Check Entity Linkages

python

# Verify person-household mapping is valid
person_hh_ids = set(output.person["person_household_id"])
household_ids = set(output.household["household_id"])
assert person_hh_ids.issubset(household_ids), "Invalid linkage"

Verify Weights

python

# Check weights sum correctly
total_persons = output.person["person_weight"].sum()
print(f"Weighted population: {total_persons:,.0f}")

# Check for missing weights
assert not output.person["person_weight"].isna().any(), "Missing weights"

Search AI Tools

Install this agent skill to your Project

SKILL.md

PolicyEngine Simulation Mechanics

CRITICAL: Environment Setup

CRITICAL: Correct Import Paths

UK Datasets

Loading UK datasets

Loading US datasets

Inspecting available variables

For Analysts: Core Concepts

Simulation Lifecycle

The Four Methods

When to Use Each

How ensure() Works

Example: Reusing Baseline Across Reforms

Accessing Results: output_dataset.data

Structure (UK Example)

US Entity Structure

Available Variables

Direct Data Access

MicroDataFrame Automatic Weighting

Entity Mapping with map_to_entity()

Method Signature

Aggregation Methods

Example 1: Sum Person Income to Household

Example 2: Broadcast Household Rent to Persons

Example 3: Divide Household Value Per Person

Example 4: Map Custom Values

Example 5: Multi-Column Mapping

Example 6: Cross-Entity Mapping (Group to Group)

Automatic Mapping in Aggregate Classes

Common Patterns

Pattern 1: Compare Baseline vs Reform

Pattern 2: Calculate Custom Derived Variable

Pattern 3: Extract Subset for Analysis

Pattern 4: Reuse Baseline Across Multiple Reforms

Direct Data Analysis (without Aggregate)

Full working example: energy spending by income decile and tenure type

Performance Tips

For Contributors: Implementation

Debugging Tips

Verify Simulation Ran

Check Entity Linkages

Verify Weights

Related Documentation