Agent skill

experiment-design-checklist

Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates.

Stars 3
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/48Nauts-Operator/opencode-baseline/tree/main/.opencode/skill/experiment-design-checklist

SKILL.md

Experiment Design Checklist

Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.

The Core Principle

Before running ANY experiment, you should be able to answer:

  1. What specific claim will this experiment support or refute?
  2. What would convince a skeptical reviewer?
  3. What could go wrong that would invalidate the results?

Process

Step 1: State the Hypothesis Precisely

Convert your research question into falsifiable predictions:

Template:

If [intervention/method], then [measurable outcome], because [mechanism].

Examples:

  • "If we add auxiliary contrastive loss, then downstream task accuracy increases by >2%, because representations become more separable."
  • "If we use learned positional encodings, then performance on sequences >4096 tokens improves, because the model can extrapolate beyond training length."

Null hypothesis: What does "no effect" look like? This is what you're trying to reject.

Step 2: Identify Variables

Independent Variables (what you manipulate):

Variable Levels Rationale
[Var 1] [Level A, B, C] [Why these levels]

Dependent Variables (what you measure):

Metric How Measured Why This Metric
[Metric 1] [Procedure] [Justification]

Control Variables (what you hold constant):

Variable Fixed Value Why Fixed
[Var 1] [Value] [Prevents confound X]

Step 3: Choose Baselines

Every experiment needs comparisons. No result is meaningful in isolation.

Baseline Hierarchy:

  1. Random/Trivial Baseline

    • What does random chance achieve?
    • Sanity check that the task isn't trivial
  2. Simple Baseline

    • Simplest reasonable approach
    • Often embarrassingly effective
  3. Standard Baseline

    • Well-known method from literature
    • Apples-to-apples comparison
  4. State-of-the-Art Baseline

    • Current best published result
    • Only if you're claiming SOTA
  5. Ablated Self

    • Your method minus key components
    • Shows each component contributes

For each baseline, document:

  • Source (paper, implementation)
  • Hyperparameters used
  • Whether you re-ran or used reported numbers
  • Any modifications made

Step 4: Design Ablations

Ablations answer: "Is each component necessary?"

Ablation Template:

Variant What's Removed/Changed Expected Effect If No Effect...
Full Model Nothing Best performance -
w/o Component A Remove A Performance drops X% A isn't helping
w/o Component B Remove B Performance drops Y% B isn't helping
Component A only Only A, no B Shows A's isolated contribution -

Good ablations are:

  • Surgical (one change at a time)
  • Interpretable (clear what was changed)
  • Informative (result tells you something)

Step 5: Address Confounds

Things that could explain your results OTHER than your hypothesis:

Common Confounds:

Confound How to Check How to Control
Hyperparameter tuning advantage Same tuning budget for all Report tuning procedure
Compute advantage Matched FLOPs/params Report compute used
Data leakage Check train/test overlap Strict separation
Random seed luck Multiple seeds Report variance
Implementation bugs (baseline) Verify baseline numbers Use official implementations
Cherry-picked examples Random or systematic selection Pre-register selection criteria

Step 6: Statistical Rigor

Sample Size:

  • How many random seeds? (Minimum: 3, better: 5+)
  • How many data splits? (If applicable)
  • Power analysis: Can you detect expected effect size?

What to Report:

  • Mean ± standard deviation (or standard error)
  • Confidence intervals where appropriate
  • Statistical significance tests if claiming "better"

Appropriate Tests:

Comparison Test Assumptions
Two methods, normal data t-test Normality, equal variance
Two methods, unknown dist Mann-Whitney U Ordinal data
Multiple methods ANOVA + post-hoc Normality
Multiple methods, unknown Kruskal-Wallis Ordinal data
Paired comparisons Wilcoxon signed-rank Same test instances

Avoid:

  • p-hacking (running until significant)
  • Multiple comparison problems (Bonferroni correct)
  • Reporting only favorable metrics

Step 7: Compute Budget

Before running, estimate:

Component Estimate Notes
Single training run X GPU-hours [Details]
Hyperparameter search Y runs × X hours [Search strategy]
Baselines Z runs × W hours [Which baselines]
Ablations N variants × X hours [Which ablations]
Seeds M seeds × above [How many seeds]
Total T GPU-hours Buffer: 1.5-2x

Go/No-Go Decision: Is this feasible with available resources?

Step 8: Pre-Registration (Optional but Recommended)

Write down BEFORE running:

  • Exact hypotheses
  • Primary metrics (not chosen post-hoc)
  • Analysis plan
  • What would constitute "success"

This prevents unconscious goal-post moving.

Output: Experiment Design Document

markdown
# Experiment Design: [Title]

## Hypothesis
[Precise statement]

## Variables
### Independent
[Table]

### Dependent
[Table]

### Controls
[Table]

## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]

## Ablations
[Table]

## Confound Mitigation
[Table]

## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [α level]

## Compute Budget
[Table with total estimate]

## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]

## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]

## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]

Red Flags in Experiment Design

🚩 "We'll figure out the metrics later" 🚩 "One run should be enough" 🚩 "We don't need baselines, it's obviously better" 🚩 "Let's just see what happens" 🚩 "We can always run more if it's not significant" 🚩 No compute estimate before starting 🚩 Vague success criteria

Expand your agent's capabilities with these related and highly-rated skills.

48Nauts-Operator/opencode-baseline

file-organizer

Organize files and folders intelligently with duplicate detection

3 0
Explore
48Nauts-Operator/opencode-baseline

nx-workspace-patterns

Configure and optimize Nx monorepo workspaces. Use when setting up Nx, configuring project boundaries, optimizing build caching, or implementing affected commands.

3 0
Explore
48Nauts-Operator/opencode-baseline

auth-implementation-patterns

Master authentication and authorization patterns including JWT, OAuth2, session management, and RBAC to build secure, scalable access control systems. Use when implementing auth systems, securing APIs, or debugging security issues.

3 0
Explore
48Nauts-Operator/opencode-baseline

sql-optimization-patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

3 0
Explore
48Nauts-Operator/opencode-baseline

monorepo-management

Master monorepo management with Turborepo, Nx, and pnpm workspaces to build efficient, scalable multi-package repositories with optimized builds and dependency management. Use when setting up monorepos, optimizing builds, or managing shared dependencies.

3 0
Explore
48Nauts-Operator/opencode-baseline

git-advanced-workflows

Master advanced Git workflows including rebasing, cherry-picking, bisect, worktrees, and reflog to maintain clean history and recover from any situation. Use when managing complex Git histories, collaborating on feature branches, or troubleshooting repository issues.

3 0
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results