Agent skill
skillgrade-setup
Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation.
Install this agent skill to your Project
npx add-skill https://github.com/mgechev/skillgrade/tree/main/skills/skillgrade-setup
SKILL.md
Skillgrade Evaluation Setup
Procedures
Step 1: Install Skillgrade
- Verify Node.js 20+ and Docker are available.
- Run
npm i -g skillgradeto install the CLI globally.
Step 2: Initialize an Eval Configuration
- Navigate to the skill directory (must contain a
SKILL.md). - Set the appropriate API key environment variable (
GEMINI_API_KEY,ANTHROPIC_API_KEY, orOPENAI_API_KEY). - Run
skillgrade initto generate aneval.yamlwith AI-powered tasks and graders. - If an
eval.yamlalready exists, pass--forceto overwrite:skillgrade init --force. - Without an API key, a well-commented template is generated instead.
Step 3: Configure eval.yaml
- Read
references/eval-yaml-spec.mdfor the full configuration schema. - Define one or more tasks under the
tasks:key. Each task requires:name: unique task identifierinstruction: what the agent should accomplishworkspace: files to copy into the evaluation containergraders: one or more scoring mechanisms (see theskillgrade-gradersskill)
- Optionally configure
defaults:for agent, provider, trials, timeout, and threshold.
Step 4: Run Evaluations
- Select an appropriate preset based on the evaluation goal:
--smoke(5 trials): Quick capability check.--reliable(15 trials): Reliable pass rate estimate.--regression(30 trials): High-confidence regression detection.
- Run the evaluation:
skillgrade --smoke. - Run a specific eval by name:
skillgrade --eval=fix-linting. - Run multiple evals:
skillgrade --eval=fix-linting,write-tests. - Run only deterministic graders (skip LLM calls):
skillgrade --grader=deterministic. - Run only LLM rubric graders:
skillgrade --grader=llm_rubric. - The agent is auto-detected from the API key. Override with
--agent=gemini|claude|codex. - Override the provider with
--provider=docker|local.
Step 5: Review Results
- Run
skillgrade previewfor a CLI report. - Run
skillgrade preview browserto open the web UI athttp://localhost:3847. - Reports are saved to
$TMPDIR/skillgrade/<skill-name>/results/. Override with--output=DIR.
Step 6: Integrate with CI
- Add a GitHub Actions step that installs skillgrade, navigates to the skill directory, and runs with
--regression --ci --provider=local. - Use
--provider=localin CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit. - The
--ciflag causes a non-zero exit code if the pass rate falls below--threshold(default: 0.8). - Read
references/ci-example.mdfor a complete workflow template.
Error Handling
- If
skillgrade initfails with "No SKILL.md found," verify the current directory contains a validSKILL.mdfile. - If evaluation hangs, check Docker is running and the container has network access for API calls.
- If all trials fail with "No API key," ensure the environment variable is exported, not just set inline for a different command.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
superlint
angular-modern-apis
Guidelines for using modern Angular APIs (signals, inject, control flow)
skillgrade-graders
Authors deterministic and LLM rubric graders for skillgrade evaluations. Use when creating scoring scripts, writing evaluation rubrics, or combining multiple graders with weighted scoring. Don't use for setting up eval pipelines, configuring eval.yaml defaults, or general test writing.
skill-creator
Authors and structures professional-grade agent skills following the agentskills.io spec. Use when creating new skill directories, drafting procedural instructions, or optimizing metadata for discoverability. Don't use for general documentation, non-agentic library code, or README files.
verl-rl-training
Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
openrlhf-training
High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
Didn't find tool you were looking for?