Agent skill

skillgrade-setup

Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation.

View SKILL.md on GitHub Repository

Stars 366

Forks 29

Install this agent skill to your Project

npx add-skill https://github.com/mgechev/skillgrade/tree/main/skills/skillgrade-setup

SKILL.md

Skillgrade Evaluation Setup

Procedures

Step 1: Install Skillgrade

Verify Node.js 20+ and Docker are available.
Run npm i -g skillgrade to install the CLI globally.

Step 2: Initialize an Eval Configuration

Navigate to the skill directory (must contain a SKILL.md).
Set the appropriate API key environment variable (GEMINI_API_KEY, ANTHROPIC_API_KEY, or OPENAI_API_KEY).
Run skillgrade init to generate an eval.yaml with AI-powered tasks and graders.
If an eval.yaml already exists, pass --force to overwrite: skillgrade init --force.
Without an API key, a well-commented template is generated instead.

Step 3: Configure eval.yaml

Read references/eval-yaml-spec.md for the full configuration schema.
Define one or more tasks under the tasks: key. Each task requires:
- name: unique task identifier
- instruction: what the agent should accomplish
- workspace: files to copy into the evaluation container
- graders: one or more scoring mechanisms (see the skillgrade-graders skill)
Optionally configure defaults: for agent, provider, trials, timeout, and threshold.

Step 4: Run Evaluations

Select an appropriate preset based on the evaluation goal:
- --smoke (5 trials): Quick capability check.
- --reliable (15 trials): Reliable pass rate estimate.
- --regression (30 trials): High-confidence regression detection.
Run the evaluation: skillgrade --smoke.
Run a specific eval by name: skillgrade --eval=fix-linting.
Run multiple evals: skillgrade --eval=fix-linting,write-tests.
Run only deterministic graders (skip LLM calls): skillgrade --grader=deterministic.
Run only LLM rubric graders: skillgrade --grader=llm_rubric.
The agent is auto-detected from the API key. Override with --agent=gemini|claude|codex.
Override the provider with --provider=docker|local.

Step 5: Review Results

Run skillgrade preview for a CLI report.
Run skillgrade preview browser to open the web UI at http://localhost:3847.
Reports are saved to $TMPDIR/skillgrade/<skill-name>/results/. Override with --output=DIR.

Step 6: Integrate with CI

Add a GitHub Actions step that installs skillgrade, navigates to the skill directory, and runs with --regression --ci --provider=local.
Use --provider=local in CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit.
The --ci flag causes a non-zero exit code if the pass rate falls below --threshold (default: 0.8).
Read references/ci-example.md for a complete workflow template.

Error Handling

If skillgrade init fails with "No SKILL.md found," verify the current directory contains a valid SKILL.md file.
If evaluation hangs, check Docker is running and the container has network access for API calls.
If all trials fail with "No API key," ensure the environment variable is exported, not just set inline for a different command.

Maintainer

mgechev Core maintainer

Source details

Full Name: mgechev/skillgrade
Branch: main
Path in repo: skills/skillgrade-setup
License: MIT License
Topics: agent claude-code gemini-cli skill codex eval

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

mgechev/skillgrade

superlint

366 29

Explore

mgechev/skillgrade

angular-modern-apis

Guidelines for using modern Angular APIs (signals, inject, control flow)

366 29

Explore

mgechev/skillgrade

skillgrade-graders

Authors deterministic and LLM rubric graders for skillgrade evaluations. Use when creating scoring scripts, writing evaluation rubrics, or combining multiple graders with weighted scoring. Don't use for setting up eval pipelines, configuring eval.yaml defaults, or general test writing.

366 29

Explore

mgechev/skills-best-practices

skill-creator

Authors and structures professional-grade agent skills following the agentskills.io spec. Use when creating new skill directories, drafting procedural instructions, or optimizing metadata for discoverability. Don't use for general documentation, non-agentic library code, or README files.

1,785 126

Explore

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298

Explore

davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Skillgrade Evaluation Setup

Procedures

Error Handling

Recommended Agent Skills

superlint

angular-modern-apis

skillgrade-graders

skill-creator

verl-rl-training

openrlhf-training