Agent skill

eval-workflow

Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance

View SKILL.md on GitHub Repository

Stars 107

Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/jmagly/aiwg/tree/main/agentic/code/addons/aiwg-evals/skills/eval-workflow

SKILL.md

Workflow Evaluation

Run automated evaluation tests against a multi-agent workflow.

Research Foundation

REF-001: BP-9 - Continuous evaluation of agent performance
REF-002: KAMI benchmark methodology for real agentic task evaluation

Usage

bash

/eval-workflow flow-security-review-cycle
/eval-workflow flow-inception-to-elaboration --scenario distractor-test
/eval-workflow flow-deploy-to-production --verbose --strict

Arguments

Argument	Required	Description
workflow-name	Yes	Workflow (flow command) to evaluate

Options

Option	Default	Description
--scenario	all	Specific scenario to run
--verbose	false	Show detailed test output
--output	stdout	Output file for results
--strict	false	Fail on any test failure
--timeout	300	Maximum seconds per scenario

What Gets Evaluated

Orchestration Quality

Agent coordination: Parallel agents launched correctly in single message
Handoff fidelity: Artifacts pass correctly between phases
Gate enforcement: Phase gates checked before transition

Archetype Resistance

grounding-test — Archetype 1: Premature action without reading state
distractor-test — Archetype 3: Context pollution from irrelevant artifacts
recovery-test — Archetype 4: Fragile execution when subagent fails

Output Validation

Required artifacts created in correct .aiwg/ paths
Document structure matches templates
Traceability links intact

Process

Load Workflow: Read flow command definition
Select Scenarios: Based on --scenario flag or all applicable
Setup Workspace: Create isolated .aiwg/working/ test space
Execute Flow: Run workflow against each scenario
Validate Outputs: Check artifact presence, structure, and content
Generate Report: Output results with pass/fail per assertion
Cleanup: Remove test workspace

Output Format

json

{
  "workflow": "flow-security-review-cycle",
  "timestamp": "2026-04-01T10:30:00Z",
  "scenarios": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "assertions": [
        {"name": "threat-model-created", "passed": true},
        {"name": "security-gate-run", "passed": true}
      ],
      "duration_ms": 45000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.7,
      "assertions": [
        {"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"}
      ],
      "duration_ms": 38000
    }
  },
  "summary": {
    "passed": 4,
    "failed": 1,
    "total": 5,
    "score": 0.80
  }
}

Examples

bash

# Full evaluation of a workflow
/eval-workflow flow-security-review-cycle

# Single scenario with verbose output
/eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose

# Strict mode with output saved
/eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json

Success Criteria

Metric	Target
Artifact creation	100%
Grounding compliance	>90%
Distractor resistance	>80%
Recovery success	≥80%
Overall	≥85%

Related Commands

/eval-agent - Test individual agents
/eval-report - Generate aggregate quality report
aiwg lint agents - Static validation

Evaluate workflow: $ARGUMENTS

References

@$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Parallel agent coordination rules evaluated in workflows
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success thresholds and test criteria
@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC flow commands available for workflow evaluation
@$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint and eval commands

Maintainer

jmagly Core maintainer

Source details

Full Name: jmagly/aiwg
Branch: main
Path in repo: agentic/code/addons/aiwg-evals/skills/eval-workflow
License: MIT License
Topics: claude-code anthropic workflow-automation developer-tools agentic-coding prompt-engineering autonomous-agents multi-agent orchestration sdlc

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

jmagly/aiwg

research-document

Generate summaries and literature notes from research papers

107 15

Explore

jmagly/aiwg

research-archive

Package research artifacts for long-term archival

107 15

Explore

jmagly/aiwg

research-cite

Format citations and generate bibliographies

107 15

Explore

jmagly/aiwg

induct-research

Induct research sources into a research repository. Point at an issue, a single file, a directory of papers, or a URI and the skill reads, annotates, and files structured induction tasks — one per source. Similar to address-issues but for research corpora instead of code backlogs.

107 15

Explore

jmagly/aiwg