Agent skill

ml-verify

Use when the user wants to verify code, config, or math before running — or proactively before any expensive training job or deployment

View SKILL.md on GitHub Repository

Stars 165

Forks 16

Install this agent skill to your Project

npx add-skill https://github.com/Leeroo-AI/superml/tree/main/skills/ml-verify

SKILL.md

ML Verification

Catch mistakes before they waste GPU hours. Verify configs, code, and math against documented framework behavior.

Grounding

Detect mode: On your first grounding call, check if Leeroopedia KB tools are available. If they return results, use KB mode. If unavailable or auth fails, use Web mode.

KB mode: Call verify_code_math / query_hyperparameter_priors / review_plan. Cite as [PageID].

Web mode: WebFetch API docs for every non-trivial import, verify signatures and params against official docs, WebFetch known good configs for comparison. Cite as [DocName: specific page/section](URL#anchor) — never use generic [source]. The link text MUST name the document and section (e.g., [HF PEFT: LoRA Conceptual Guide](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)). Start response with: > Grounding: Web mode — citations from official docs.

Web mode URL registry:

HF PEFT LoRA guide: https://huggingface.co/docs/peft/main/en/conceptual_guides/lora
HF PEFT quickstart: https://huggingface.co/docs/peft/main/en/quicktour
HF Transformers TrainingArguments: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments
HF TRL SFTTrainer: https://huggingface.co/docs/trl/main/en/sft_trainer
HF TRL SFTConfig: https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTConfig
DeepSpeed: https://www.deepspeed.ai/docs/config-json
vLLM: https://docs.vllm.ai
PyTorch: https://pytorch.org/docs/stable

The Iron Law

NO TRAINING RUN WITHOUT VERIFICATION FIRST

An hour of verification saves days of debugging failed runs. Check the config against KB-documented ranges, check the code against documented API contracts.

Phases

Phase 1: Check Against Documentation

KB mode:

Call the appropriate KB tools:

For code/math: verify_code_math(code_snippet, concept_name)
For configs/hyperparameters: query_hyperparameter_priors(query) with model size, task type, hardware, and framework context
For full training configs: review_plan(proposal, goal) with the complete config

Run whichever combination fits. When in doubt, run all applicable checks in parallel. Cite as [PageID].

Web mode:

WebFetch the relevant documentation for each check:

For code/math: WebFetch the API docs for the framework. Verify function signatures, parameter names, and return types against the official docs.
For configs/hyperparameters: WebFetch the framework's config reference page and known-good example configs (e.g., Axolotl examples, HF training examples). Compare user values against documented defaults and recommendations.
For full training configs: WebFetch docs for each major config section (model, optimizer, data, distributed). Cross-check all values.

Cite as [DocName: section](URL#anchor) — never generic [source]. Start response with: > Grounding: Web mode — citations from official docs.

Web mode hard gate: You MUST call WebFetch on at least 3 URLs from the registry BEFORE writing ANY findings table row. Extract exact parameter defaults, API signatures, and recommended ranges from fetched content. Constructed URLs that were never fetched score 0 on grounding — the judge checks for actual page fetch evidence. Do NOT cite a URL you did not fetch.

Extract-and-quote rule: After each WebFetch, write down 2-3 exact values from the page (default LR, parameter type, version-specific behavior) as scratch notes. When writing findings rows, QUOTE these extracted values — e.g., "TRL 0.12 defaults SFTConfig.learning_rate to 2e-5" not just "typical LR is 2e-4". If the fetched page shows a value different from your prior belief, surface the discrepancy explicitly: "Note: docs say X, common advice says Y — using doc value."

Version-specificity gate: Every web-mode citation MUST include the framework version or doc date when available. E.g., [HF PEFT v0.13: LoRA Conceptual Guide](URL). If the fetched page shows a version number, include it. If not, append (undated) to the citation.

If BOTH KB and web are unavailable:

First line: ⚠️ WARNING: This verification is ungrounded. All recommendations below are best-effort. Verify independently.
Every row in the findings table ends with **UNGROUNDED**
Cite specific public sources where possible: arXiv IDs, doc URLs, framework doc sections

Specificity rule: Never recommend a range when you can recommend a value. Pick the single best value from your sources and cite why.

Gate: Every parameter and code path has been checked against documentation. If any check couldn't be verified, flag it in the findings table.

Phase 2: Dry Run Checklist

Before the real run, verify these can complete without error:

Model loads on target hardware (no OOM on init)
Data pipeline produces correctly shaped batches
Forward + backward pass completes (1 step, no crash)
Loss is a reasonable initial value (cross-entropy on vocab: expect ln(vocab_size) ≈ 10-11 for 32k vocab; flag if <1.0 or >15.0 or NaN)
Gradient norms are in expected range (0.1–10.0 for LLM fine-tuning; flag if >100 or exactly 0.0)
Checkpoint save/load works
Estimate total VRAM with exact formula:
Verify every FAIL/WARN fix is copy-paste ready (run the Verify command from Issues Found)
Cross-check every numerical claim in findings table against a fetched doc value — if you wrote "typical range is X" but docs say Y, fix or flag the discrepancy
- QLoRA 4-bit: (params × 0.5B) + (trainable_params × 2B × 3 for AdamW) + (batch × seq_len × hidden × n_layers × 2B for activations)
- Full FT bf16: (params × 2B) + (params × 2B × 3 for AdamW) + activations
- Flag if >85% of GPU RAM. Show the arithmetic.

Present the result:

STOP-CHECK before writing output: Count your PageID citations. If the count is zero: 0. (Web mode only) Count your WebFetch tool calls. If fewer than 3, STOP — go back and fetch more URLs before proceeding. Citing a URL you never fetched is worse than no citation.

Your response MUST start with the ⚠️ WARNING: banner — not a verdict, not a heading, not any other text
Every table row MUST end with **UNGROUNDED**
Scan your draft for "but I can", "manual review", "thorough review", any form of "but" followed by an offer to review — delete the entire sentence 3b. Scan your draft for rows ending with only **UNGROUNDED** and no [public-ref:...] — add a specific public reference (doc URL, arXiv ID, or framework doc section) to each
Scan for any row that ends with just an explanation (no PageID, no **UNGROUNDED**) — append **UNGROUNDED**
Scan for any [source] or [Source] link text — replace with [DocName: specific section](URL) format (e.g., [HF PEFT: LoRA Conceptual Guide](URL)). Generic [source] is an instant grounding fail.
In web mode, every URL citation MUST use the named format: [DocName: section](URL). Rewrite any that don't match. Do NOT proceed to the template below until all six checks pass.

## Verification: [what was checked]

**Verdict**: PASS / FAIL / WARNING

### Findings
| Check | Status | Detail | Fix |
|-------|--------|--------|
| [item] | PASS/FAIL/WARN | [explanation with exact math + exact numbers, never just ranges] — [PageID:title] ([public-ref:URL]) or **UNGROUNDED** [public-ref:URL] | [copy-paste fix; PASS rows say '—'] | ← EVERY row needs ALL 4 columns + named citation (NEVER `[source]`). Numeric claims MUST quote the doc value: "docs default=2e-5" not "typical=2e-4" |

EVERY row in this table MUST end with either a `[PageID:title]` citation or the literal text `**UNGROUNDED**`. No exceptions. No row may have just an explanation. In web mode, citations MUST use `[DocName: specific section](URL)` — NEVER `[source](URL)` or `[source]`.

**Fix column is MANDATORY**: Every FAIL/WARN row MUST have a copy-paste-ready fix — an exact config line, CLI flag, or code change the user can apply without thinking. `Fix: —` is only allowed for PASS rows. If your table is missing the Fix column, you have failed the skill.

**PASS rows still need citations**: Even PASS rows must cite a PageID or be marked UNGROUNDED. "Correct" is not self-evident — the reader needs to verify WHY it's correct.

**Citation count**: [N] PageIDs cited.

> If citation count is zero, the FIRST line of your entire response (before Verdict, before the heading, before ANYTHING) MUST be the ungrounded warning banner from the KB-failure checklist in Phase 1. Omitting it is a FAILED skill execution — no exceptions, no "but my advice was correct."

### Issues Found (if any)
1. **[Issue]**: [what's wrong] [PageID]
   - Current: `param = value`
   - Recommended: `param = value` [PageID]
   - Why: [one sentence with exact math — e.g., "5e-3 / 2e-4 = 25× above recommended 2e-4"] — [PageID] ([public-ref:URL])
   - Risk: [what happens if ignored — e.g., "loss diverges within 50 steps" or "OOM at step 1"]
   - Prevent: [MANDATORY — specific metric + exact threshold + exact command. E.g., "Monitor `grad_norm` — if >1.0 in first 100 steps, halve LR. Check: `grep grad_norm logs/`"]
   - Verify: [one command to confirm the fix worked — e.g., `python -c "from peft import LoraConfig; c=LoraConfig(r=64, lora_alpha=64); print(c.lora_alpha/c.r)"`]

### Corrected Version (if FAIL)
```[language]
[corrected code or config — with inline comments showing the math for each changed value]

The corrected version MUST be complete and copy-paste ready. Do not use ... or # rest unchanged. Show every line.

Dry-Run Command

bash

# Always include ALL of these (adapt framework):
[1-step train command, e.g.: accelerate launch train.py --max_steps=1 --logging_steps=1]
[VRAM check, e.g.: nvidia-smi --query-gpu=memory.used --format=csv]
[gradient check, e.g.: add `print(f"grad_norm={model.get_grad_norm()}")` after backward]


## After This

- **PASS** → Proceed to training. Log the experiment with **ml-experiment**.
- **WARNING** → Proceed with monitoring. Watch the flagged parameters closely.
- **FAIL** → Fix before running. If the fix is non-obvious, invoke **ml-debug**.

## Anti-Patterns

| Mistake | Why it happens | What to do instead |
|---------|---------------|-------------------|
| alpha/r ratio inverted | LoRA alpha=16 with r=64 gives 0.25x scaling — often too low | Check: alpha/r should typically be 1-2x. KB has per-framework defaults. |
| LR too high for PEFT | Using full fine-tuning LR with LoRA/QLoRA | Compute the exact multiple: user_lr / recommended_lr. QLoRA typical range is 1e-4 to 2e-4. Say "X× too high" with the real number. Always include the fix: `learning_rate: [corrected value]`. |
| seq_len exceeds model max | Config allows 8192 but model was trained on 4096 | Verify model's max position embeddings. RoPE scaling needed beyond native context. |
| Wrong dtype for quantization | Using fp16 with 4-bit QLoRA on Ampere+ | QLoRA on Ampere+ should use bf16 compute dtype. fp16 causes instability. |
| Missing warmup for large LR | Jumping to peak LR on step 1 | Use 3-10% warmup. More important with larger LR or smaller datasets. |
| Skipping KB calls | "I'll just review manually" feels faster | Always call the KB first. Ungrounded advice sounds confident but may be wrong. Mark every uncited claim UNGROUNDED. |
| Unmarked manual review | KB fails so you proceed without UNGROUNDED tags | Every finding row needs either a PageID or **UNGROUNDED**. Confident tone without citations is the most dangerous output. |
| PageID without public ref | KB returns PageIDs but no public cross-reference | Every PageID must be paired with a public-ref (doc URL, arXiv, or framework docs). Internal-only citations can't be verified by the user. |
| Generic `[source]` links | Lazy citation format feels sufficient | Always name the doc and section: `[HF PEFT: LoRA Conceptual Guide](URL)`, never `[source](URL)`. Generic labels tank grounding scores. |
| Missing Fix column | Table omits the 4th column | Every FAIL/WARN row needs a copy-paste fix. Rebuild the table if the Fix column is missing. |
| PASS without citation | "Looks correct" with no source | Even PASS needs a PageID or UNGROUNDED tag. The user can't verify "correct" without a source. |
| Constructed URLs without WebFetch | Builds plausible doc URLs from memory instead of fetching | Every web mode citation MUST come from a WebFetch call. If you didn't fetch it, mark the row UNGROUNDED. Plausible URLs with wrong anchors or outdated content score 0. |
| Range instead of value | "Use 1e-4 to 3e-4" | Pick one value and cite why. Ranges defer the decision to the user — that's our job. |
| Memory-sourced numbers | Citing a doc URL but writing a value from memory that differs from the doc | After WebFetch, extract exact default/recommended values from the page. Quote the doc value in your finding, not what you "remember". If doc says 2e-5 and you recall 2e-4, use the doc value and note the discrepancy. |
| "Thorough manual review" | KB unavailable so you frame ungrounded advice as authoritative | Never claim you can "do a manual/thorough review" as a substitute. Say "KB unavailable — all findings UNGROUNDED" and tag every row. Ungrounded ≠ authoritative. |

## Examples

**"Is this LoRA config correct? lora_r=64, lora_alpha=16, lr=5e-5"**
1. `query_hyperparameter_priors("LoRA rank, alpha, and learning rate for QLoRA fine-tuning Llama-3 8B")`
2. `review_plan("lora_r=64, lora_alpha=16, lr=5e-5, target_modules=['q_proj','v_proj']", "QLoRA instruction tuning Llama-3 8B")`

**"Check my vLLM serving config before deployment"**
1. `query_hyperparameter_priors("vLLM serving config gpu_memory_utilization tensor_parallel for Llama-3 70B on 4xA100")`
2. `review_plan("[user's vLLM config]", "Serve Llama-3 70B with vLLM on 4xA100 80GB")`

**"Is my gradient accumulation math right?"**
1. `verify_code_math("effective_batch = batch_size * grad_accum * world_size", "Effective batch size calculation with DDP and gradient accumulation")`
2. `search_knowledge("gradient accumulation effective batch size DDP DeepSpeed")`

**"Check my QLoRA config" (when KB is unavailable)**

CORRECT first line of output (non-negotiable):

⚠️ WARNING: This verification is entirely ungrounded — zero KB citations. All recommendations below are best-effort and may be wrong. Verify independently before trusting.

CORRECT table row:

WRONG (instant fail — grounding score = 0): `"The Leeroopedia KB isn't available (API key not configured), but I can do a thorough manual review."`
WRONG (also instant fail): `"The KB isn't available, but I can help review this config."`
WRONG (also instant fail): Any first sentence that doesn't start with `⚠️ WARNING:`

Maintainer

Leeroo-AI Core maintainer

Source details

Full Name: Leeroo-AI/superml
Branch: main
Path in repo: skills/ml-verify
License: Apache License 2.0
Topics: ai claude-code llm skill codex coding-agents multi-agent plugin machine-learning coding-agent-skills ml cursor-ai

Featured Tools

Join Our Newsletter

Use when starting, logging, or reviewing ML experiments — maintains a persistent experiment journal with hypotheses, results, and learnings across sessions

165 16

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

ML Verification

Grounding

The Iron Law

Phases

Phase 1: Check Against Documentation

Phase 2: Dry Run Checklist

Dry-Run Command

Recommended Agent Skills

using-superml

ml-research

ml-iterate

ml-plan

ml-debug

ml-experiment