Agent skill

evaluate

Evaluate the quality of any AI-generated artifact — visualizations, code, documents, conversations, or any skill output. Works in 3 phases: (1) Generate evaluation specs tailored to the artifact type, (2) Run comprehensive evaluation against those specs, (3) Produce a beautiful visual report using the /visualize skill. Use after any skill produces output, or invoke directly with /evaluate <file-or-context>. Supports evaluating: HTML visualizations, code projects, documents, agent conversations, slide decks, dashboards, or any artifact with quality dimensions.

View SKILL.md on GitHub Repository

Stars 103

Forks 27

Install this agent skill to your Project

npx add-skill https://github.com/careerhackeralex/visualize/tree/main/eval

SKILL.md

Evaluate

Comprehensive quality evaluation for any AI-generated artifact. Produces its report as a visualization.

How It Works

┌──────────────────────────────────────────────┐
│                                              │
│  Phase 1: SPEC GENERATION                    │
│  Analyze the artifact type                   │
│  Generate tailored evaluation criteria       │
│  Define scoring dimensions + weights         │
│  Set quality gates                           │
│           │                                  │
│           ▼                                  │
│  Phase 2: EVALUATION                         │
│  Run automated checks (when possible)        │
│  Visual/manual inspection                    │
│  Score each dimension with evidence          │
│  Identify systemic vs local issues           │
│           │                                  │
│           ▼                                  │
│  Phase 3: REPORT (via /visualize)            │
│  Generate a beautiful HTML eval report       │
│  Scores, charts, screenshots, fix list       │
│  Radar chart of dimensions                   │
│  Before/after tracking                       │
│                                              │
└──────────────────────────────────────────────┘

Phase 1: Spec Generation

For any artifact, generate evaluation specs by analyzing:

1. Identify Artifact Type

HTML Visualization → visual design, interactivity, technical, content, shareability
Code/Project → correctness, readability, architecture, test coverage, performance
Document/Report → clarity, structure, accuracy, completeness, tone
Conversation/Agent → helpfulness, accuracy, tone, efficiency, safety
Slide Deck → all visualization dims + narrative flow, persuasion, pacing
Dashboard → data accuracy, information density, scannability, actionability
Custom → derive dimensions from the skill's SKILL.md and stated goals

2. Generate Dimensions

For each artifact type, produce 6-10 evaluation dimensions. Each dimension needs:

Name — short, clear label
Description — what this dimension measures
Weight — percentage (all weights sum to 100%)
Scoring anchors — what does a 10, 8, 6, 4 look like?
Automated checks — any programmatic tests (if applicable)
Deductions — specific issues and their point costs

3. Set Quality Gates

Define gates based on the artifact's purpose:

Gate	Criteria	Meaning
🚀 EXCEPTIONAL	Overall ≥ 9.5, all ≥ 9	Best-in-class. Share everywhere.
✅ SHIP	Overall ≥ 9.0, all ≥ 8	Production-ready.
⚠️ ACCEPTABLE	Overall ≥ 8.0, all ≥ 7	Usable but not impressive.
🔧 NEEDS WORK	Overall ≥ 7.0 or any < 7	Fix before releasing.
❌ FAIL	Overall < 7.0 or any < 5	Major rework.

4. Output Spec Document

Write the spec to eval-spec-[artifact-name].md for reference and reuse.

Phase 2: Evaluation

For HTML Visualizations

Open in browser at 3 viewports (1280×720, 768×1024, 375×667).

Automated audit (run in browser console):

javascript

(function() {
  const audit = {};
  const style = [...document.querySelectorAll('style')].map(s => s.textContent).join(' ');
  const html = document.documentElement.outerHTML;

  // Structure
  audit.hasDoctype = /^<!doctype html>/i.test(html);
  audit.hasLangAttr = !!document.documentElement.lang;
  audit.hasCharset = !!document.querySelector('meta[charset]');
  audit.hasViewport = !!document.querySelector('meta[name="viewport"]');
  audit.hasTitle = document.title.length > 0;

  // Menu system
  audit.menuExists = !!document.querySelector('.viz-menu');
  audit.menuHasTheme = !!html.match(/cycleTheme|themeLabel/i);
  audit.menuHasDownload = !!html.match(/htmlToImage|html-to-image/i);
  audit.menuHasPrint = !!html.match(/window\.print/i);

  // Theme system
  audit.hasCSSVars = !!style.match(/--bg\s*:/);
  audit.hasDarkTheme = !!style.match(/(\.theme-dark|:root)[\s\S]*?--bg/);
  audit.hasLightTheme = !!style.match(/\.theme-light/);
  audit.themePersistedToStorage = !!html.match(/localStorage.*theme/i);

  // Typography
  audit.hasInterFont = !!html.match(/fonts\.googleapis.*Inter|font-family.*Inter/i);
  audit.hasFontFallback = !!style.match(/-apple-system|system-ui/);
  audit.bodyFontSize = parseFloat(getComputedStyle(document.body).fontSize);
  audit.bodyFontOK = audit.bodyFontSize >= 14;

  // Layout
  audit.usesFlexOrGrid = !!(style.match(/display\s*:\s*(flex|grid)/));
  audit.hasMaxWidth = !!style.match(/max-width/);
  audit.hasResponsiveBreakpoints = !!style.match(/@media.*max-width|@media.*min-width|sm:|md:|lg:/);

  // Print & Accessibility
  audit.hasPrintStyles = !!style.match(/@media\s*print/);
  audit.hasPrintColorAdjust = !!style.match(/print-color-adjust/);
  audit.hasReducedMotion = !!style.match(/prefers-reduced-motion/);
  audit.hasAriaLabels = !!html.match(/aria-label/);
  audit.hasSemanticHTML = !!html.match(/<(header|main|nav|section|article|footer)/);

  // Animations
  audit.hasKeyframes = !!style.match(/@keyframes/);
  audit.hasTransitions = !!style.match(/transition\s*:/);

  // Performance
  audit.fileSizeKB = Math.round(new Blob([html]).size / 1024);
  audit.fileSizeOK = audit.fileSizeKB < 200;
  audit.noExternalImages = document.querySelectorAll('img[src^="http"]').length === 0;
  audit.htmlToImageLoaded = typeof htmlToImage !== 'undefined';

  // Summary
  const bools = Object.entries(audit).filter(([k,v]) => typeof v === 'boolean');
  const passed = bools.filter(([k,v]) => v).length;
  audit._passed = passed;
  audit._total = bools.length;
  audit._percent = Math.round(passed / bools.length * 100);
  audit._failures = bools.filter(([k,v]) => !v).map(([k]) => k);

  console.table(audit);
  return audit;
})();

Visual scoring — 8 dimensions for visualizations:

#	Dimension	Weight	10 =	6 =
D1	First Impression	15%	Apple keynote quality	Generic template feel
D2	Typography	15%	Perfect hierarchy, Inter font, fluid sizing	All same size, no hierarchy
D3	Color & Contrast	10%	Harmonious, WCAG AA, both themes beautiful	Clashing, low contrast
D4	Layout & Spacing	15%	Consistent rhythm, responsive, generous space	Cramped, broken at mobile
D5	Content Quality	15%	Clear message in 5 seconds, zero filler	Confusing, placeholder text
D6	Interactivity	10%	Menu + theme + download + print all flawless	Missing features, broken
D7	Technical	10%	Zero errors, semantic, accessible, print-ready	Console errors, broken layout
D8	Shareability	10%	Would tweet this unprompted	Worse than Canva

For Code/Projects

Dimensions: Correctness, Readability, Architecture, Error Handling, Performance, Testing, Documentation, Security

For Documents

Dimensions: Clarity, Structure, Accuracy, Completeness, Tone, Formatting, Actionability, Brevity

For Agent Conversations

Dimensions: Helpfulness, Accuracy, Tone, Efficiency, Safety, Context Awareness, Tool Usage, Follow-through

Phase 3: Visual Report (via /visualize)

After scoring, generate the eval report as a beautiful HTML dashboard using the visualize skill:

Report Structure

Hero — artifact name, overall score (big number), quality gate badge
Radar Chart — all dimensions plotted on a radar/spider chart (Chart.js)
Dimension Cards — each dimension as a card with score, bar, key notes
Automated Audit — pass/fail checklist with percentages
Screenshots — key views embedded (if HTML artifact)
Fix List — prioritized fixes as a kanban-style layout (critical / high / medium / low)
Systemic Issues — patterns that affect all outputs (flagged for SKILL.md fixes)
History — if re-evaluating, show before/after score comparison chart

Report Filename

eval-report-[artifact-name]-[date].html

The report itself must score ≥ 9.0 on the visualize eval criteria.

This is the ultimate dogfood test — our evaluation tool produces evaluations using our visualization tool.

The Improvement Loop

Generate artifact (any skill)
       ↓
/evaluate → Spec + Score + Visual Report
       ↓
Review report → identify fixes
       ↓
Fix (systemic → SKILL.md, local → artifact)
       ↓
/evaluate again → compare scores
       ↓
Ship when gate = SHIP or EXCEPTIONAL

Max 3 loops per artifact. If it can't reach SHIP in 3 loops, the problem is in the skill — update the skill's instructions, not the artifact.

Quick Start

# Evaluate a visualization
/evaluate path/to/visualization.html

# Evaluate with custom context
/evaluate path/to/code-project --type code

# Re-evaluate after fixes (tracks improvement)
/evaluate path/to/visualization.html --loop 2

# Generate specs only (no scoring)
/evaluate --specs-only --type dashboard

Maintainer

careerhackeralex Core maintainer

Source details

Full Name: careerhackeralex/visualize
Branch: main
Path in repo: eval
License: MIT License
Topics: ai claude-code skill visualization open-source presentations html dashboards infographics

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

careerhackeralex/visualize

visualize

Create beautiful, self-contained HTML visualizations from any content or idea. Use for: slide decks, presentations, infographics, dashboards, flowcharts, diagrams, timelines, comparison tables, data visualizations, landing pages, one-pagers, org charts, mind maps, process flows, kanban boards, report summaries, or any visual that helps humans digest information faster. Trigger on requests like "visualize this," "make a deck," "create a slide," "build an infographic," "show me a dashboard," "make this visual," or any request to present information in a visual HTML format.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

Didn't find tool you were looking for?