Agent skill
backtest-expert
Expert guidance for systematic backtesting of trading strategies. Use when developing, testing, stress-testing, or validating quantitative trading strategies. Covers "beating ideas to death" methodology, parameter robustness testing, slippage modeling, bias prevention, and interpreting backtest results. Applicable when user asks about backtesting, strategy validation, robustness testing, avoiding overfitting, or systematic trading development.
Install this agent skill to your Project
npx add-skill https://github.com/tradermonty/claude-trading-skills/tree/main/skills/backtest-expert
SKILL.md
Backtest Expert
Systematic approach to backtesting trading strategies based on professional methodology that prioritizes robustness over optimistic results.
Core Philosophy
Goal: Find strategies that "break the least", not strategies that "profit the most" on paper.
Principle: Add friction, stress test assumptions, and see what survives. If a strategy holds up under pessimistic conditions, it's more likely to work in live trading.
When to Use This Skill
Use this skill when:
- Developing or validating systematic trading strategies
- Evaluating whether a trading idea is robust enough for live implementation
- Troubleshooting why a backtest might be misleading
- Learning proper backtesting methodology
- Avoiding common pitfalls (curve-fitting, look-ahead bias, survivorship bias)
- Assessing parameter sensitivity and regime dependence
- Setting realistic expectations for slippage and execution costs
Prerequisites
- Python 3.9+ (for evaluation script)
- No API keys required
- No external data dependencies — metrics are user-provided
Workflow
1. State the Hypothesis
Define the edge in one sentence.
Example: "Stocks that gap up >3% on earnings and pull back to previous day's close within first hour provide mean-reversion opportunity."
If you can't articulate the edge clearly, don't proceed to testing.
2. Codify Rules with Zero Discretion
Define with complete specificity:
- Entry: Exact conditions, timing, price type
- Exit: Stop loss, profit target, time-based exit
- Position sizing: Fixed $$, % of portfolio, volatility-adjusted
- Filters: Market cap, volume, sector, volatility conditions
- Universe: What instruments are eligible
Critical: No subjective judgment allowed. Every decision must be rule-based and unambiguous.
3. Run Initial Backtest
Test over:
- Minimum 5 years (preferably 10+)
- Multiple market regimes (bull, bear, high/low volatility)
- Realistic costs: Commissions + conservative slippage
Examine initial results for basic viability. If fundamentally broken, iterate on hypothesis.
4. Stress Test the Strategy
This is where 80% of testing time should be spent.
Parameter sensitivity:
- Test stop loss at 50%, 75%, 100%, 125%, 150% of baseline
- Test profit target at 80%, 90%, 100%, 110%, 120% of baseline
- Vary entry/exit timing by ±15-30 minutes
- Look for "plateaus" of stable performance, not narrow spikes
Execution friction:
- Increase slippage to 1.5-2x typical estimates
- Model worst-case fills (buy at ask+1 tick, sell at bid-1 tick)
- Add realistic order rejection scenarios
- Test with pessimistic commission structures
Time robustness:
- Analyze year-by-year performance
- Require positive expectancy in majority of years
- Ensure strategy doesn't rely on 1-2 exceptional periods
- Test in different market regimes separately
Sample size:
- Absolute minimum: 30 trades
- Preferred: 100+ trades
- High confidence: 200+ trades
5. Out-of-Sample Validation
Walk-forward analysis:
- Optimize on training period (e.g., Year 1-3)
- Test on validation period (Year 4)
- Roll forward and repeat
- Compare in-sample vs out-of-sample performance
Warning signs:
- Out-of-sample <50% of in-sample performance
- Need frequent parameter re-optimization
- Parameters change dramatically between periods
6. Evaluate Results
Questions to answer:
- Does edge survive pessimistic assumptions?
- Is performance stable across parameter variations?
- Does strategy work in multiple market regimes?
- Is sample size sufficient for statistical confidence?
- Are results realistic, not "too good to be true"?
Decision criteria:
- ✅ Deploy: Survives all stress tests with acceptable performance
- 🔄 Refine: Core logic sound but needs parameter adjustment
- ❌ Abandon: Fails stress tests or relies on fragile assumptions
Use the evaluation script for a structured, quantitative assessment:
python3 skills/backtest-expert/scripts/evaluate_backtest.py \
--total-trades 150 \
--win-rate 62 \
--avg-win-pct 1.8 \
--avg-loss-pct 1.2 \
--max-drawdown-pct 15 \
--years-tested 8 \
--num-parameters 3 \
--slippage-tested \
--output-dir reports/
The script scores across 5 dimensions (Sample Size, Expectancy, Risk Management, Robustness, Execution Realism), detects red flags, and outputs a Deploy/Refine/Abandon verdict.
Key Testing Principles
Punish the Strategy
Add friction everywhere:
- Commissions higher than reality
- Slippage 1.5-2x typical
- Worst-case fills
- Order rejections
- Partial fills
Rationale: Strategies that survive pessimistic assumptions often outperform in live trading.
Seek Plateaus, Not Peaks
Look for parameter ranges where performance is stable, not optimal values that create performance spikes.
Good: Strategy profitable with stop loss anywhere from 1.5% to 3.0% Bad: Strategy only works with stop loss at exactly 2.13%
Stable performance indicates genuine edge; narrow optima suggest curve-fitting.
Test All Cases, Not Cherry-Picked Examples
Wrong approach: Study hand-picked "market leaders" that worked Right approach: Test every stock that met criteria, including those that failed
Selective examples create survivorship bias and overestimate strategy quality.
Separate Idea Generation from Validation
Intuition: Useful for generating hypotheses Validation: Must be purely data-driven
Never let attachment to an idea influence interpretation of test results.
Common Failure Patterns
Recognize these patterns early to save time:
- Parameter sensitivity: Only works with exact parameter values
- Regime-specific: Great in some years, terrible in others
- Slippage sensitivity: Unprofitable when realistic costs added
- Small sample: Too few trades for statistical confidence
- Look-ahead bias: "Too good to be true" results
- Over-optimization: Many parameters, poor out-of-sample results
See references/failed_tests.md for detailed examples and diagnostic framework.
Output
reports/backtest_eval_<timestamp>.json— structured evaluation with per-dimension scores, red flags, and verdictreports/backtest_eval_<timestamp>.md— human-readable report with dimension table, key metrics, and red flag details
Resources
Methodology Reference
File: references/methodology.md
When to read: For detailed guidance on specific testing techniques.
Contents:
- Stress testing methods
- Parameter sensitivity analysis
- Slippage and friction modeling
- Sample size requirements
- Market regime classification
- Common biases and pitfalls (survivorship, look-ahead, curve-fitting, etc.)
Failed Tests Reference
File: references/failed_tests.md
When to read: When strategy fails tests, or learning from past mistakes.
Contents:
- Why failures are valuable
- Common failure patterns with examples
- Case study documentation framework
- Red flags checklist for evaluating backtests
Critical Reminders
Time allocation: Spend 20% generating ideas, 80% trying to break them.
Context-free requirement: If strategy requires "perfect context" to work, it's not robust enough for systematic trading.
Red flag: If backtest results look too good (>90% win rate, minimal drawdowns, perfect timing), audit carefully for look-ahead bias or data issues.
Tool limitations: Understand your backtesting platform's quirks (interpolation methods, handling of low liquidity, data alignment issues).
Statistical significance: Small edges require large sample sizes to prove. 5% edge per trade needs 100+ trades to distinguish from luck.
Discretionary vs Systematic Differences
This skill focuses on systematic/quantitative backtesting where:
- All rules are codified in advance
- No discretion or "feel" in execution
- Testing happens on all historical examples, not cherry-picked cases
- Context (news, macro) is deliberately stripped out
Discretionary traders study differently—this skill may not apply to setups requiring subjective judgment.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
technical-analyst
This skill should be used when analyzing weekly price charts for stocks, stock indices, cryptocurrencies, or forex pairs. Use this skill when the user provides chart images and requests technical analysis, trend identification, support/resistance levels, scenario planning, or probability assessments based purely on chart data without consideration of news or fundamental factors.
market-environment-analysis
Comprehensive market environment analysis and reporting tool. Analyzes global markets including US, European, Asian markets, forex, commodities, and economic indicators. Provides risk-on/risk-off assessment, sector analysis, and technical indicator interpretation. Triggers on keywords like market analysis, market environment, global markets, trading environment, market conditions, investment climate, market sentiment, forex analysis, stock market analysis, 相場環境, 市場分析, マーケット状況, 投資環境.
us-stock-analysis
Comprehensive US stock analysis including fundamental analysis (financial metrics, business quality, valuation), technical analysis (indicators, chart patterns, support/resistance), stock comparisons, and investment report generation. Use when user requests analysis of US stock tickers (e.g., "analyze AAPL", "compare TSLA vs NVDA", "give me a report on Microsoft"), evaluation of financial metrics, technical chart analysis, or investment recommendations for American stocks.
stanley-druckenmiller-investment
スタンレー・ドラッケンミラーの投資哲学と戦略に基づいた投資アドバイスを提供。30年間無敗、年率30%近いリターンを達成した伝説的投資家の思考法を活用し、マクロ経済分析、リスク管理、ポジション構築、市場サイクルの読み方などについて実践的な指導を行う。投資判断、市場分析、リスク管理、ポートフォリオ構築などの相談時に使用。
earnings-calendar
This skill retrieves upcoming earnings announcements for US stocks using the Financial Modeling Prep (FMP) API. Use this when the user requests earnings calendar data, wants to know which companies are reporting earnings in the upcoming week, or needs a weekly earnings review. The skill focuses on mid-cap and above companies (over $2B market cap) that have significant market impact, organizing the data by date and timing in a clean markdown table format. Supports multiple environments (CLI, Desktop, Web) with flexible API key management.
breadth-chart-analyst
This skill should be used when analyzing market breadth charts, specifically the S&P 500 Breadth Index (200-Day MA based) and the US Stock Market Uptrend Stock Ratio charts. Use this skill when the user provides breadth chart images for analysis, requests market breadth assessment, positioning strategy recommendations, or wants to understand medium-term strategic and short-term tactical market outlook based on breadth indicators. All analysis and output are conducted in English.
Didn't find tool you were looking for?