Agent skill
policyengine-simulation-mechanics
ALWAYS LOAD THIS SKILL before writing any policyengine.py microsimulation code. Contains correct import paths, environment setup, dataset loading, and analysis patterns. Triggers: "write a script", "policyengine.py", "microsimulation script", "run a simulation", "load the dataset", "FRS", "EFRS", "enhanced FRS", "CPS", "enhanced CPS", "by income decile", "by tenure", "by region", "energy spending", "domestic energy", "household net income", "output_dataset", "ensure_datasets", "uk_datasets", "us_datasets", "import datasets", "from policyengine", "Simulation(dataset=", "uk_latest", "us_latest", "plotly", "analysis script", "decile breakdown", "percentile", "groupby", "weighted", "mean", "median", "p25", "p75", "tenure type", "income band", "policy reform script".
Install this agent skill to your Project
npx add-skill https://github.com/PolicyEngine/policyengine-claude/tree/main/skills/tools-and-apis/policyengine-simulation-mechanics-skill
SKILL.md
PolicyEngine Simulation Mechanics
This skill covers advanced patterns for working with policyengine.py simulations, including caching, result access, and entity mapping.
CRITICAL: Environment Setup
Before writing any code, check the environment. The policyengine.py package must be installed in the project's .venv.
# Always run from the policyengine.py repo root:
cd /path/to/policyengine.py
uv run python script.py
# Or activate first:
source .venv/bin/activate
python script.py
# NEVER use bare `pip install` — always:
uv pip install -e ".[uk]" # for UK work
uv pip install -e ".[us]" # for US work
If from policyengine.core import Simulation fails:
cd /path/to/policyengine.py
uv pip install -e ".[uk]"
# Then re-run with: uv run python script.py
CRITICAL: Correct Import Paths
Only these imports exist — do not guess others:
# Core simulation
from policyengine.core import Simulation
# UK model
from policyengine.tax_benefit_models.uk import (
uk_latest, # The model version (pass as tax_benefit_model_version=)
uk_model, # The model itself
PolicyEngineUKDataset,
UKYearData,
create_datasets, # Create & cache datasets from HF source
load_datasets, # Load cached datasets from disk
ensure_datasets, # Create if missing, load if present (recommended)
)
# US model
from policyengine.tax_benefit_models.us import (
us_latest,
PolicyEngineUSDataset,
ensure_datasets,
)
# Outputs
from policyengine.outputs.aggregate import Aggregate, AggregateType
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType
# Plotting
from policyengine.utils.plotting import COLORS, format_fig
There is NO:
policyengine.core.dataset_registrypolicyengine.datasetspolicyengine.core.dataset_version.DatasetVersion.list()
UK Datasets
Loading UK datasets
Use ensure_datasets() — it returns a dict[str, PolicyEngineUKDataset], building files in ./data/ on first run and loading from disk on subsequent runs.
WARNING: from policyengine.tax_benefit_models.uk import datasets gives you the Python submodule, not a dict. Never index it like a dict.
from policyengine.tax_benefit_models.uk import ensure_datasets
uk = ensure_datasets(
datasets=[
"hf://policyengine/policyengine-uk-data/frs_2023_24.h5",
"hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5",
],
years=[2026],
data_folder="./data",
)
efrs = uk["enhanced_frs_2023_24_2026"]
frs = uk["frs_2023_24_2026"]
Dict key format: "{stem}_{year}" e.g. "enhanced_frs_2023_24_2026"
To force regeneration: delete ./data/ and call ensure_datasets() again.
Loading US datasets
from policyengine.tax_benefit_models.us import ensure_datasets
us = ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2026],
data_folder="./data",
)
ecps = us["enhanced_cps_2024_2026"]
Default US dataset: enhanced_cps_2024.h5 (Enhanced CPS), years 2024–2028.
Inspecting available variables
Always inspect the dataset to find available variable names — never guess:
from policyengine.tax_benefit_models.uk import ensure_datasets
uk = ensure_datasets(years=[2026], data_folder="./data")
d = uk["enhanced_frs_2023_24_2026"]
# Input variables (present in raw data)
print("household:", list(d.data.household.columns))
print("person: ", list(d.data.person.columns))
print("benunit: ", list(d.data.benunit.columns))
Input variables are what's in the raw survey data — demographics, reported incomes, consumption, wealth, flags.
Computed variables (household_net_income, income_tax, universal_credit, etc.) are not in the raw dataset — they are calculated by the simulation. To see what's available after running:
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest
sim = Simulation(dataset=d, tax_benefit_model_version=uk_latest)
sim.run()
print("household (post-sim):", list(sim.output_dataset.data.household.columns))
print("person (post-sim): ", list(sim.output_dataset.data.person.columns))
The computed variables available are defined by uk_latest.entity_variables — inspect this to see the full list without running a simulation:
from policyengine.tax_benefit_models.uk import uk_latest
print(uk_latest.entity_variables) # dict: entity → [variable names]
For Analysts: Core Concepts
When running simulations with policyengine.py (the microsimulation package, not the API client), you work with three key components:
Simulation.ensure()- Smart caching to avoid redundant computationsimulation.output_dataset.data- Accessing calculated resultsmap_to_entity()- Converting data between entity levels (person ↔ household)
Note: This is for microsimulation with policyengine.py, not the policyengine Python API client (which uses Simulation(situation=...)).
Simulation Lifecycle
The Four Methods
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest
simulation = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
)
# Method 1: Always run (no caching)
simulation.run()
# Method 2: Run only if needed (recommended)
simulation.ensure()
# Method 3: Save results to disk
simulation.save()
# Method 4: Load results from disk
simulation.load()
When to Use Each
run(): Use when you need fresh results or parameters changedensure(): Use for iterative development (checks cache → disk → run)save(): Use to persist large simulation resultsload(): Use to resume from previous session
How ensure() Works
def ensure(self):
# 1. Check in-memory LRU cache (100 simulations)
cached = _cache.get(self.id)
if cached:
self.output_dataset = cached.output_dataset
return
# 2. Try loading from disk
try:
self.tax_benefit_model_version.load(self)
except Exception:
# 3. Only run if both cache and disk fail
self.run()
self.save()
# 4. Add to cache for next ensure() call
_cache.add(self.id, self)
Performance impact:
- First call: Full simulation runtime (seconds to minutes)
- Same session: Instant (in-memory cache)
- New session: Fast (disk load, no recomputation)
Example: Reusing Baseline Across Reforms
# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure() # First call: runs simulation
baseline.save() # Persist to disk
# Test multiple reforms
for reform in [reform1, reform2, reform3]:
baseline.ensure() # Instant from cache!
reform_sim = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform
)
reform_sim.run() # Only reform needs to run
# Compare results...
Accessing Results: output_dataset.data
After running a simulation, all calculated variables are in simulation.output_dataset.data.
Structure (UK Example)
simulation.run()
# Access output container
output = simulation.output_dataset.data
# Entity-level MicroDataFrames
output.person # Person-level results
output.benunit # Benefit unit results
output.household # Household-level results
US Entity Structure
# US has more entities
output.person
output.tax_unit # Federal tax filing unit
output.spm_unit # Supplemental Poverty Measure unit
output.family # Census family definition
output.marital_unit # Married couple or single
output.household
Available Variables
Each dataframe contains input variables + calculated variables:
# Person-level (UK)
print(output.person.columns)
# ['person_id', 'person_household_id', 'age', 'employment_income',
# 'income_tax', 'national_insurance', 'net_income', ...]
# Household-level (UK)
print(output.household.columns)
# ['household_id', 'region', 'rent', 'household_net_income',
# 'household_benefits', 'household_tax', ...]
# Benunit-level (UK)
print(output.benunit.columns)
# ['benunit_id', 'universal_credit', 'child_benefit',
# 'working_tax_credit', 'child_tax_credit', ...]
Direct Data Access
# Get specific columns
incomes = output.household[["household_id", "household_net_income"]]
# Filter data
high_earners = output.person[output.person["employment_income"] > 100000]
# Calculate statistics (automatically weighted!)
mean_income = output.household["household_net_income"].mean()
total_tax = output.household["household_tax"].sum()
# Access individual values
first_hh_income = output.household["household_net_income"].iloc[0]
MicroDataFrame Automatic Weighting
All operations respect survey weights automatically:
# These are all weighted calculations
total_population = output.person["person_weight"].sum()
mean_income = output.household["household_net_income"].mean()
poverty_rate = output.household["in_absolute_poverty_bhc"].mean()
# Groupby operations are weighted
by_region = output.household.groupby("region")["household_net_income"].mean()
Entity Mapping with map_to_entity()
Convert data between entity levels (e.g., sum person income to household, or broadcast household rent to persons).
Method Signature
output.map_to_entity(
source_entity: str, # Entity to map from
target_entity: str, # Entity to map to
columns: list[str] = None, # Columns to map (None = all)
values: np.ndarray = None, # Custom values instead
how: str = "sum" # Aggregation method
)
Aggregation Methods
Person → Group (aggregation):
how="sum"(default): Sum values within each grouphow="first": Take first value in each grouphow="mean": Average valueshow="max": Maximum valuehow="min": Minimum value
Group → Person (expansion):
how="project"(default): Broadcast group value to all membershow="divide": Split group value equally among members
Example 1: Sum Person Income to Household
# Sum employment income across all people in each household
household_employment = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=["employment_income"],
how="sum"
)
# Result is MicroDataFrame at household level
print(household_employment.columns)
# ['household_id', 'employment_income'] # Now household total
Example 2: Broadcast Household Rent to Persons
# Give each person their household's rent value
person_rent = output.map_to_entity(
source_entity="household",
target_entity="person",
columns=["rent"],
how="project"
)
# Each person now has their household's rent
print(person_rent.columns)
# ['person_id', 'rent']
Example 3: Divide Household Value Per Person
# Split household savings equally among members
person_savings_share = output.map_to_entity(
source_entity="household",
target_entity="person",
columns=["total_savings"],
how="divide"
)
# If household has £12,000 savings and 3 people, each gets £4,000
Example 4: Map Custom Values
import numpy as np
# Calculate custom person-level values
custom_tax = np.where(
output.person["employment_income"] > 50000,
output.person["income_tax"] * 1.1, # 10% increase for high earners
output.person["income_tax"]
)
# Aggregate to household level
household_custom_tax = output.map_to_entity(
source_entity="person",
target_entity="household",
values=custom_tax,
how="sum"
)
Example 5: Multi-Column Mapping
# Map multiple income sources to household level
household_incomes = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=[
"employment_income",
"self_employment_income",
"pension_income",
"savings_interest_income"
],
how="sum"
)
# Result has all columns at household level
Example 6: Cross-Entity Mapping (Group to Group)
# UK: Map benunit benefits to household level
# (Multiple benunits can exist in one household)
household_uc = output.map_to_entity(
source_entity="benunit",
target_entity="household",
columns=["universal_credit", "child_benefit"],
how="sum"
)
Automatic Mapping in Aggregate Classes
The Aggregate and ChangeAggregate classes automatically handle entity mapping when the variable and target entity don't match:
from policyengine.outputs.aggregate import Aggregate, AggregateType
# income_tax is person-level, but we want household-level sum
total_tax = Aggregate(
simulation=simulation,
variable="income_tax", # Person-level
entity="household", # Household-level aggregation
aggregate_type=AggregateType.SUM,
)
total_tax.run()
# Automatically maps income_tax from person to household using sum()
Common Patterns
Pattern 1: Compare Baseline vs Reform
# Run both simulations
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
reform = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform_policy
)
reform.ensure()
# Get outputs
baseline_out = baseline.output_dataset.data
reform_out = reform.output_dataset.data
# Calculate differences
baseline_income = baseline_out.household["household_net_income"]
reform_income = reform_out.household["household_net_income"]
difference = reform_income - baseline_income
# Count winners/losers (weighted)
winners = (difference > 0).sum()
losers = (difference < 0).sum()
unchanged = (difference == 0).sum()
Pattern 2: Calculate Custom Derived Variable
# Calculate marginal tax rate at person level
person_data = output.person.copy()
person_data["mtr"] = (
(person_data["income_tax"] + person_data["national_insurance"])
/ person_data["employment_income"].clip(lower=1)
) * 100
# Map to household level (max MTR in household)
household_mtr = output.map_to_entity(
source_entity="person",
target_entity="household",
values=person_data["mtr"].values,
how="max"
)
Pattern 3: Extract Subset for Analysis
# Get London households with children
london_hh = output.household[output.household["region"] == "LONDON"]
households_with_children = output.person.groupby("person_household_id")["age"].apply(
lambda ages: (ages < 18).any()
)
# Combine filters
london_ids = set(london_hh["household_id"])
hh_with_kids_ids = set(households_with_children[households_with_children].index)
target_ids = london_ids & hh_with_kids_ids
# Extract subset
subset_hh = output.household[output.household["household_id"].isin(target_ids)]
subset_persons = output.person[output.person["person_household_id"].isin(target_ids)]
Pattern 4: Reuse Baseline Across Multiple Reforms
# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
baseline.save()
# Test multiple reforms efficiently
reforms = [reform1, reform2, reform3]
results = {}
for reform in reforms:
baseline.ensure() # Instant from cache
reform_sim = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform
)
reform_sim.run()
# Calculate impact
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType
revenue = ChangeAggregate(
baseline_simulation=baseline,
reform_simulation=reform_sim,
variable="household_tax",
aggregate_type=ChangeAggregateType.SUM,
)
revenue.run()
results[reform.name] = revenue.result
Direct Data Analysis (without Aggregate)
For custom analyses (decile breakdowns, percentiles, groupby), work directly with output_dataset.data after running the simulation. This is often simpler than using Aggregate.
Full working example: energy spending by income decile and tenure type
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset, uk_latest
# Load dataset
dataset = PolicyEngineUKDataset(
name="Enhanced FRS 2026",
description="EFRS 2026",
filepath="./data/enhanced_frs_2023_24_year_2026.h5",
year=2026,
)
dataset.load()
# Run simulation to compute income variables
simulation = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
simulation.ensure() # caches to disk after first run
# Access results as DataFrames
hh = simulation.output_dataset.data.household
# Assign income decile (weighted quantile)
hh["income_decile"] = pd.qcut(
hh["household_net_income"],
q=10,
labels=[f"D{i}" for i in range(1, 11)],
)
# Group and calculate stats
stats = (
hh.groupby(["income_decile", "tenure_type"])["domestic_energy_consumption"]
.agg(
mean="mean",
p25=lambda x: np.percentile(x, 25),
p75=lambda x: np.percentile(x, 75),
)
.reset_index()
)
Key points:
simulation.output_dataset.data.householdis aMicroDataFramewith weightsdomestic_energy_consumptionis household-level (annual £)tenure_typevalues:OWNED_OUTRIGHT,OWNED_WITH_MORTGAGE,RENT_FROM_COUNCIL,RENT_PRIVATELY,RENT_FROM_HA- Income deciles must be computed from simulation output (not raw data)
Performance Tips
- Use
ensure()for iterative work: Can save minutes when re-running analyses - Filter before mapping: Reduces computation on large datasets
- Use
Aggregateclasses: Optimised implementations for common operations - Batch similar calculations: Run multiple aggregates in sequence
- Cache intermediate results: Store derived calculations
# Good: Filter then map
high_earners = output.person[output.person["employment_income"] > 100000]
high_earner_hh_income = output.map_to_entity(
source_entity="person",
target_entity="household",
values=high_earners["employment_income"].values,
how="sum"
)
# Less efficient: Map then filter
all_hh_income = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=["employment_income"],
how="sum"
)
high_earner_hh = all_hh_income[all_hh_income["employment_income"] > 100000]
For Contributors: Implementation
Current implementation:
# Simulation lifecycle
cat policyengine.py/src/policyengine/core/simulation.py
# Entity mapping logic
cat policyengine.py/src/policyengine/core/dataset.py
# Cache implementation
cat policyengine.py/src/policyengine/core/cache.py
Key patterns:
- Simulation caching: LRU cache with max 100 entries, keyed by UUID
- Entity mapping: Automatic detection of mapping direction (person→group or group→person)
- MicroDataFrame: All entity data uses weighted DataFrames from microdf package
Related skills:
policyengine-core-skill- Understanding simulation engine architecturemicrodf-skill- Working with weighted DataFramespolicyengine-python-client-skill- Basic simulation usage
Debugging Tips
Verify Simulation Ran
assert simulation.output_dataset is not None, "Simulation hasn't run"
# Check for expected variables
expected = ["household_net_income", "household_tax"]
actual = simulation.output_dataset.data.household.columns
assert all(v in actual for v in expected), "Missing variables"
Check Entity Linkages
# Verify person-household mapping is valid
person_hh_ids = set(output.person["person_household_id"])
household_ids = set(output.household["household_id"])
assert person_hh_ids.issubset(household_ids), "Invalid linkage"
Verify Weights
# Check weights sum correctly
total_persons = output.person["person_weight"].sum()
print(f"Weighted population: {total_persons:,.0f}")
# Check for missing weights
assert not output.person["person_weight"].isna().any(), "Missing weights"
Related Documentation
In policyengine.py repo:
.claude/policyengine-guide.md- High-level patterns.claude/quick-reference.md- Syntax cheat sheet.claude/working-with-simulations.md- Detailed simulation guideexamples/- Full working examplesdocs/core-concepts.md- Architecture documentation
Didn't find tool you were looking for?