Agent skill

visual-verify

This skill should be used when the user asks to 'verify visual output', 'check how it looks', 'render and review', 'visual verify', 'check the slide', 'does this look right', or when any task produces rendered visual output (slides, charts, documents, UI). Starts a render-vision-fix loop using Gemini vision.

Stars 6
Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/edwinhu/workflows/tree/main/skills/visual-verify

SKILL.md

Announce: "I'm using visual-verify to set up a render-vision-fix loop."

NO VISUAL TASK IS COMPLETE WITHOUT RENDERING, SCORING, AND MEETING THE THRESHOLD.

Source code correctness does NOT imply visual correctness. You MUST render to PNG, score with vision (0-10) — Gemini via look-at, or a subagent with Read if Gemini is unavailable — and iterate until score >= 9.5. Claiming "done" with a score below threshold delivers broken visuals to the user.

Skipping the score check is NOT HELPFUL — the user gets a visual artifact with defects you didn't verify. </EXTREMELY-IMPORTANT>

Agentic Vision Routing

Route based on image complexity, not source language.

Agentic vision (--agentic) enables Gemini's Think-Act-Observe loop: it can crop, zoom, annotate, and re-examine regions of the image autonomously. This catches fine-grained defects that full-image viewing misses.

Image characteristic --agentic? Why
Dense diagram (5+ nodes, many arrows/labels) YES Can crop individual arrow paths, zoom into small labels to check connections
Small text or labels near container edges YES Implicit zoom catches clipping that full-image view misses
Side-by-side or before/after comparison YES Can crop matching regions and annotate differences
Simple layout (2-3 large elements) NO Full image is sufficient; agentic adds latency for no gain
Single chart with large text NO No fine-grained details to investigate

Decision rule: If the image has details that are easy to miss at full resolution, use --agentic. If you can see everything clearly at a glance, skip it.

The Loop

0. PAGE MAP -> If Typst + Touying: Skill("teaching:find-slide-page")
       |      Returns heading → physical page mapping
       |      Skip if: single-page file, non-Typst, or page already known
       |
1. CHANGE  -> Modify source code (Task agent)
       |
2. RENDER  -> Produce PNG (see references/render-commands.md)
       |      Render fails? -> fix source, back to step 1
       |
3. VISION  -> Two-part check:
       |      a) INTENT — Does the render match the design intent?
       |         (Does the visual structure argue what it should?)
       |      b) DEFECTS — Scan for the 9 visual defect categories:
       |         1. Text clipped by or overflowing its container
       |         2. Text or shapes overlapping other elements
       |         3. Arrows crossing through elements instead of routing around
       |         4. Arrows landing on wrong element or pointing into empty space
       |         5. Labels floating ambiguously (not anchored to what they describe)
       |         6. Labels squeezed between adjacent elements without clearance
       |         7. Uneven spacing (cramped sections next to spacious ones)
       |         8. Text too small to read at rendered size
       |         9. Parallel sub-diagrams with inconsistent layout
       |      Complexity-routed look-at call with SCORING:
       |      Dense/fine-grained? -> --agentic (Gemini crops/zooms/annotates)
       |      Simple/large elements? -> vision-only (structured pixel feedback)
       |      → Score 0-10 against checklist items
       |      → Record in SCORES.md
       |
4. DECIDE  -> Score >= 9.5 AND exit criteria met? → DONE
              Score < 9.5 OR defects remain?      → extract fixes, back to step 1

When to Stop

The loop is done when ALL of these hold:

  • Score >= 9.5
  • The rendered output matches the design intent (not just defect-free but correct)
  • No text is clipped, overlapping, or unreadable
  • All arrows/lines connect to the right elements and route cleanly
  • Spacing is consistent and composition is balanced
  • You'd show it to someone without caveats

Don't stop after one clean pass just because there are no critical bugs — if the composition could be better, improve it. Conversely, don't loop forever on cosmetics — 5 iterations max before escalating to the user.

Invocation

Skill(skill="ralph-loop:ralph-loop", args="Visual Task N: [TASK NAME] --max-iterations 5 --completion-promise VTASKN_9_5")

Score Tracking

Initialize SCORES.md before the first iteration:

markdown
# Visual Verify Scores

| Iteration | Score | Threshold | BLOCKING | COSMETIC | Delta |
|-----------|-------|-----------|----------|----------|-------|

Each vision call must score the output 0-10:

  • 10.0 = all checklist items pass, zero issues
  • 9.5 = 95% pass, 1-2 cosmetic issues remain (default threshold)
  • < 9.0 = BLOCKING issues present

The score reflects the fraction of checklist items that pass. Gemini counts BLOCKING and COSMETIC issues against the domain-specific checklist, and the score = (items passing / total items) * 10.

Vision Calls

Primary: Gemini via look-at

Agentic (dense diagrams, fine-grained details, comparisons):

bash
python3 "${CLAUDE_SKILL_DIR}/../../skills/look-at/scripts/look_at.py" \
    --file "/tmp/visual-verify.png" \
    --goal "[CONTEXT-ENRICHED GOAL]" \
    --agentic

Gemini will autonomously crop, zoom, and annotate regions it wants to inspect more closely. This is most valuable for:

  • Verifying arrow connections in dense node diagrams
  • Checking whether small labels are fully visible (not clipped)
  • Comparing spatial consistency across sub-diagrams

Vision-only (simple layouts, large elements):

bash
python3 "${CLAUDE_SKILL_DIR}/../../skills/look-at/scripts/look_at.py" \
    --file "/tmp/visual-verify.png" \
    --goal "[CONTEXT-ENRICHED GOAL]"

If look-at fails because the Google API key is unavailable, missing, or returns an API error, DO NOT fall back to pdftotext. pdftotext cannot detect overlap, misalignment, or spatial defects — it only extracts text. Using it as a visual check produces false passes.

Instead, spawn a subagent that uses the Read tool directly on the rendered PNG:

Agent(
    prompt="Read the image at /tmp/visual-verify.png and score it. [CONTEXT-ENRICHED GOAL]. Score 0-10 against the checklist. Report BLOCKING and COSMETIC issues separately.",
    description="Vision fallback: score rendered PNG",
    subagent_type="general-purpose"
)

Why this is safe: The Iron Law against Read on images exists to protect the main conversation's context window (images cost 1000-5000 tokens). Subagent context is throwaway — the tokens are discarded when the agent returns its short summary. The cost rationale does not apply.

Why pdftotext is NOT safe: In the lecture 22 incident (2026-03-22), Gemini was unavailable. The agent fell back to pdftotext -layout, checked only VIS-17 (a table), declared "only one diagram in 22.typ," and missed VIS-18 entirely — a CeTZ timeline with anchor: "south" causing severe text overlap. pdftotext showed all text present (no clipping), so the agent declared it clean. The overlap was invisible to text extraction. </EXTREMELY-IMPORTANT>

Goal Assembly

NEVER call look-at with a generic goal. Goals must reference the spec, checklist items, and prior feedback.

Context Piece Source
spec_text SPEC.md, PLAN.md task, or user request
checklist_items Domain + task specific
previous_feedback Gemini's output from prior iteration

For diagrams (fletcher, CeTZ): use describe-first strategy, NOT judgment prompts.

  • Step 1: Ask Gemini to transcribe all text elements with [FULL | PARTIAL] annotations
  • Step 2: Claude diffs transcription against expected elements from source code
  • Step 3 (optional): Run spatial/overlap check for non-clipping issues
  • Why: Judgment prompts ("is X visible?") invite yes-man answers. Transcription forces the model to report what it actually sees — clipping becomes objectively detectable via text mismatches.

See references/goal-templates.md for full copy-paste templates per domain.

Translating Non-Python Feedback

Before editing Typst source to apply fixes, load the tinymist skill: Skill(skill="tinymist:typst") It has Fletcher/CeTZ reference docs with correct parameter names (label-side, label-pos, spacing, inset). Without it, you will guess at parameter names and waste iterations.

Gemini returns structured pixel measurements. Claude translates to source code:

Gemini says Claude translates to (Typst example)
"Move label 15px left" Adjust label-pos or node coordinates by ~0.5em
"Text clipped at right edge" Increase inset or reduce scale() percentage
"Node/diagram cut off at edge" Reduce canvas length, node inset, or spacing; or shift coordinates toward center
"Elements overlap vertically" Increase spacing parameter
"Font too small to read" Increase #set text(Npt) value

Complex Diagrams (3+ Failed Iterations)

If the same spatial issue persists after 3 iterations, escalate to the reference sketch approach: have Gemini draw an ideal layout in matplotlib and translate coordinates.

See references/complex-diagram-strategy.md for the full approach.

Quick Render Reference

Domain Command
Typst tinymist compile input.typ /tmp/visual-verify.png --pages N --ppi 144 (use /find-slide-page to get N)
Python python3 script.py (script saves to known path)
Screenshot screencapture -x /tmp/visual-verify.png

See references/render-commands.md for the full reference.

Red Flags — STOP If You Catch Yourself:

Action Why Wrong Do Instead
Falling back to pdftotext when Gemini is down pdftotext extracts text only — it cannot detect overlap, misalignment, or spatial defects. It produces false passes. Spawn a subagent with Read on the rendered PNG
Declaring "only N diagrams" without grepping for all VIS-* items You will miss diagrams and skip their visual check rg 'VIS-' file.typ to enumerate ALL diagrams before checking any
Skipping visual check because "the code looks correct" The anchor: "south" bug was syntactically valid Typst that produced severe overlap Render and score EVERY diagram, no exceptions

When NOT to Use

  • One-off visual checks: Use look-at directly, not the full loop
  • Text-only verification: Use standard dev-verify
  • Compilation checks only: Just run the compile command

Reference Files

  • references/goal-templates.md -- copy-paste goal templates per domain
  • references/render-commands.md -- render commands for all supported domains
  • references/rationalization-prevention.md -- excuses, red flags, honesty framing, drive-aligned consequences
  • references/complex-diagram-strategy.md -- reference sketch approach for persistent layout failures
  • references/examples.md -- worked examples (Typst slide, matplotlib chart, diagram escalation)

Expand your agent's capabilities with these related and highly-rated skills.

edwinhu/workflows

audit-fix-loop

This skill should be used when the user asks to 'iteratively improve', 'audit and fix', 'hill-climb quality', 'grade and improve', 'score and fix', 'audit loop', 'quality loop', or needs structured iterative improvement of an artifact using scored independent audits. Also use when the user invokes a ralph loop for quality improvement rather than task completion.

6 1
Explore
edwinhu/workflows

ds-spec-reviewer

Internal skill used by ds-brainstorm at Phase 1 exit gate. Dispatches a reviewer subagent to verify SPEC.md completeness before planning. NOT user-facing.

6 1
Explore
edwinhu/workflows

pptx-render

Use when the user asks to "render pptx", "show pptx slide", "compare with pptx", "pptx to image", "export pptx slide", "original slide", "show me the original", "what does the pptx look like", or needs to extract a specific PPTX slide's content for visual comparison.

6 1
Explore
edwinhu/workflows

obsidian-organize

Organize Obsidian notes according to clawd's preferences. Use when user asks to "organize notes", "move notes to right folder", "clean up vault", "tidy vault", "file this note", or when creating new notes in the Obsidian vault. Also use when moving, renaming, or categorizing notes, or when the vault root has stray files.

6 1
Explore
edwinhu/workflows

dev-verify

This skill should be used when the user asks to 'verify completion', 'check that tests pass', 'confirm feature works', or REQUIRED Phase 7 of /dev workflow (final). Enforces fresh runtime evidence before claiming completion.

6 1
Explore
edwinhu/workflows

dev

This skill should be used when the user asks to 'start a feature', 'build a feature', 'implement a feature', 'develop', 'new feature', or needs the full 7-phase development workflow with TDD enforcement.

6 1
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results