Agent skill
training-archive-gating
Mandatory training archive with model gating (APPROVED/REVIEW/DROP). Trigger when: (1) training run completes, (2) need to decide which models to deploy, (3) want historical training reference, (4) need checkpoint recommendations for overfitting.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/training-archive-gating-smith6jt-cop-skills-registry
SKILL.md
Training Archive & Model Gating System (v2.4.1)
Experiment Overview
| Item | Details |
|---|---|
| Date | 2024-12-24 |
| Goal | Create mandatory post-training archival with automatic model classification |
| Environment | alpaca_trading.training package, Colab/local training |
| Status | Success |
Context
Training runs produce summaries with validation metrics, but there was no:
- Structured archival - Summaries were ephemeral JSON files
- Model classification - No clear criteria for deployment readiness
- Overfitting detection - No automatic checkpoint recommendations
- Historical reference - No way to track model improvements over time
The solution: TrainingArchiveManager with automatic gating based on fitness, profit factor, consistency, and drawdown thresholds.
Verified Workflow
Model Gating Thresholds
NOTE (v2.4.1): Thresholds are calibrated for reward_scale=0.001. MaxDD is a proxy metric reflecting reward volatility during validation, not actual equity drawdown. With conservative reward scaling:
- 8% proxy MaxDD = rewards staying mostly positive
- 15% proxy MaxDD = occasional negative reward streaks
| Classification | Fitness | PF | Consistency | MaxDD | Action |
|---|---|---|---|---|---|
| APPROVED | >= 0.70 | >= 1.8 | >= 85% | <= 8% | Deploy to production |
| REVIEW | 0.50-0.70 | 1.3-1.8 | 65-85% | 8-15% | Manual review required |
| DROP | < 0.50 | < 1.3 | < 65% | > 15% | Do not deploy |
Overfitting Detection
# Fitness decline from peak triggers checkpoint recommendation
fitness_decline_threshold = 0.05 # 5% decline from peak
fitness_oscillation_threshold = 0.10 # 10% swing = unstable training
Archive Usage
from alpaca_trading.training import TrainingArchiveManager
# Archive training run with automatic gating
archive_mgr = TrainingArchiveManager(archive_dir='training_archives')
archive = archive_mgr.archive_training_run(summary_data)
# Results
print(f"APPROVED: {archive.approved_count}")
print(f"REVIEW: {archive.review_count}")
print(f"DROP: {archive.dropped_count}")
# Get approved models for deployment
approved = archive_mgr.get_approved_models(archive.timestamp)
print(f"Ready for deployment: {approved}")
# Get symbol history across runs
history = archive_mgr.get_symbol_history('AAPL')
Archive Structure
training_archives/
├── index.json # Master index of all runs
├── {timestamp}/
│ ├── summary.json # Raw training summary
│ ├── model_assessments.json # Per-model gating decisions
│ └── recommendations.md # Human-readable report
Gating Configuration
from alpaca_trading.training import ModelGatingConfig
# Custom thresholds (stricter than v2.4.1 defaults)
config = ModelGatingConfig(
approved_min_fitness=0.80, # Default: 0.70
approved_min_pf=2.0, # Default: 1.8
approved_min_consistency=0.90, # Default: 0.85
approved_max_drawdown=0.05, # Default: 0.08 (5% proxy MaxDD)
)
# Use custom config
classification, flags, use_checkpoint, best_idx = assess_model_quality(
final_fitness=0.75,
final_pf=1.9,
final_consistency=0.88,
final_max_dd=0.06, # 6% proxy MaxDD
fitness_history=[0.70, 0.75, 0.78, 0.75],
config=config,
)
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Manual assessment | Inconsistent criteria, subjective | Use fixed thresholds in code |
| Single metric gating | Models with high PF but poor consistency slipped through | Require ALL thresholds met |
| No overfitting detection | Models deployed that had peaked earlier | Track fitness history, recommend checkpoints |
| ASCII-only reports | Unicode errors on Windows (emoji characters) | Use encoding='utf-8' on all file operations |
Key Insights
Why Multiple Thresholds
- Fitness alone is insufficient - High fitness can mask poor profit factor
- Consistency matters - A model with PF=5.0 but 50% consistency is risky
- Drawdown is critical - High-equity models can still blow up
- All thresholds must pass - A single weak metric can indicate problems
Checkpoint Recommendations
When use_checkpoint=True is returned:
- Model's final fitness declined >5% from peak
- The
best_idxindicates which validation point had peak fitness - Calculate checkpoint update:
checkpoint_update = (best_idx + 1) * validation_interval
Flags Returned
| Flag | Meaning |
|---|---|
FITNESS_DECLINE |
Final < peak by >5% |
UNSTABLE_TRAINING |
Fitness oscillation >10% |
LOW_FITNESS |
Below REVIEW threshold |
LOW_PF |
Profit factor below threshold |
LOW_CONSISTENCY |
Consistency below threshold |
HIGH_DRAWDOWN |
Max drawdown above threshold |
Files Created
alpaca_trading/training/__init__.py # Package exports
alpaca_trading/training/gating.py # ModelGatingConfig, assess_model_quality()
alpaca_trading/training/archive.py # TrainingArchiveManager
tests/test_training_archive.py # 20 unit tests
References
alpaca_trading/training/gating.py: Lines 40-128 (gating logic)alpaca_trading/training/archive.py: Lines 55-393 (archive manager)tests/test_training_archive.py: Full test suite- CLAUDE.md: Model Gating Standards section
Didn't find tool you were looking for?