Agent skill

skill-monitor

Analyze skill effectiveness across sessions. Computes per-skill metrics (action rate, friction, outcomes), identifies degrading skills, and generates improvement recommendations. Requires session-scan data in metrics.jsonl.

View SKILL.md on GitHub Repository

Stars 252

Forks 17

Install this agent skill to your Project

npx add-skill https://github.com/oliver-kriska/claude-elixir-phoenix/tree/main/.claude/skills/skill-monitor

SKILL.md

Skill Monitor

Closed-loop skill effectiveness monitoring. Reads session metrics, computes per-skill signals, identifies what's working and what needs improvement.

Inspired by the deploy-monitor-evaluate-improve feedback loop: skills get better over time instead of staying static.

Requirements

Requires .claude/session-metrics/metrics.jsonl from /session-scan. If no data: suggest running /session-scan first.

Usage

/skill-monitor                     # Dashboard: all skills
/skill-monitor --skill review      # Deep-dive on one skill
/skill-monitor --improve           # Generate improvement recommendations
/skill-monitor --window 30d        # Change comparison window (default: 7d)

What Main Context Does

Step 1: Parse Arguments

Extract from $ARGUMENTS:

--skill NAME: Focus on one skill (e.g., review, plan, investigate)
--improve: Spawn analysis agent for improvement recommendations
--window PERIOD: Comparison window (7d, 30d, all; default: 7d)

Step 2: Load Metrics

Read .claude/session-metrics/metrics.jsonl. For each entry, extract the skill_effectiveness field (added by compute-metrics.py v2).

Filter by window period. Count sessions with and without skill usage.

If no skill_effectiveness data exists in metrics: "Metrics were computed before skill tracking was added. Run /session-scan --rescan to recompute."

Step 3: Compute Per-Skill Aggregates

For each skill found across all sessions, aggregate:

| Metric              | Computation                                    |
|---------------------|------------------------------------------------|
| Total invocations   | Sum of invocation_count across sessions        |
| Sessions used in    | Count of sessions containing this skill        |
| Action rate         | Weighted avg of per-session action_rate         |
| Avg post-errors     | Weighted avg of avg_post_errors                |
| Avg post-corrections| Weighted avg of avg_post_corrections           |
| Outcome distribution| Count of effective/friction/no_action/mixed    |
| Effectiveness score | action_rate - (0.3 * avg_post_corrections)     |
| Adjusted score      | For analysis/check skills, use lower thresholds |

Skill type weighting: Analysis and check skills (verify, triage, perf, boundaries, pr-review, audit) have low action rates BY DESIGN — their success is "found issues" or "confirmed things pass". Apply adjusted thresholds:

Skill Type	Flag Threshold	Expected Action Rate
Execution (work, quick, full)	< 0.5	> 0.7
Analysis (perf, boundaries, audit, pr-review)	< 0.3	0.3-0.5
Check (verify, triage)	< 0.1	0.0-0.3
Knowledge (compound, learn, brief)	< 0.5	> 0.5

Also compute baseline friction (avg friction of sessions WITHOUT any skill usage) vs skill friction (avg friction of sessions WITH skill usage). Delta = skill_friction - baseline_friction. Negative delta = skills reduce friction (good).

Step 4: Display Dashboard

Dashboard mode (no --skill):

## Skill Effectiveness Dashboard (last {window})

Baseline friction (no skills): 0.32 | With skills: 0.18 | Delta: -0.14

| Skill           | Uses | Sessions | Action% | Errors | Corrections | Outcome    | Score |
|-----------------|------|----------|---------|--------|-------------|------------|-------|
| /phx:review     | 12   | 8        | 92%     | 0.5    | 0.1         | effective  | 0.89  |
| /phx:plan       | 9    | 7        | 100%    | 0.2    | 0.0         | effective  | 1.00  |
| /phx:investigate| 5    | 5        | 80%     | 1.2    | 0.4         | mixed      | 0.68  |

Skills needing attention: /phx:investigate (high post-errors)

Flag skills using type-adjusted thresholds (see weighting table above). Also flag if avg_post_corrections > 1 or outcome is predominantly "friction". When displaying flagged skills, note if the flag is "expected" for the skill type (e.g., verify at 0.24 is normal for a check skill).

Skill deep-dive (--skill NAME):

Show per-session breakdown for that skill, including session IDs, dates, and individual outcome signals. If session reports exist in .claude/session-analysis/, reference them.

Step 5: Improvement Mode (--improve)

Spawn skill-effectiveness-analyzer agent:

Agent(subagent_type="skill-effectiveness-analyzer", model="sonnet", prompt="""
Analyze skill effectiveness data and recommend improvements.

Metrics data: {aggregated_metrics_json}

Sessions with friction outcomes: {session_ids}

For each underperforming skill:
1. Identify failure patterns from outcome signals
2. Propose specific skill/agent changes
3. Suggest new Iron Laws if patterns are systematic

Write recommendations to: .claude/skill-metrics/recommendations-{date}.md
""")

Step 6: Write Output

Write aggregated metrics to .claude/skill-metrics/dashboard-{date}.json:

json

{
  "computed_at": "2026-03-03T14:00:00Z",
  "window": "7d",
  "baseline_friction": 0.32,
  "skill_friction": 0.18,
  "friction_delta": -0.14,
  "skills": { ... },
  "flagged_skills": ["investigate"]
}

Append-only: never modify previous dashboard files.

Iron Laws

NEVER modify metrics.jsonl — read-only from this skill
Baseline comparison is mandatory — raw numbers without baseline are meaningless
Flag, don't judge — surface data, let the human decide what to fix
Evidence tags on recommendations — every suggestion needs session citations

Integration

/session-scan → metrics.jsonl (with skill_effectiveness)
       ↓
/skill-monitor → dashboard + flagged skills
       ↓
/skill-monitor --improve → recommendations
       ↓
Developer updates skills/agents → deploy → repeat

References

references/effectiveness-metrics.md — Full metrics schema and evaluation criteria
references/improvement-template.md — Template for improvement recommendations

Maintainer

oliver-kriska Core maintainer

Source details

Full Name: oliver-kriska/claude-elixir-phoenix
Branch: main
Path in repo: .claude/skills/skill-monitor
License: MIT License
Topics: claude-code claude claude-code-skills automation claude-skills vibe-coding claude-code-plugin elixir elixir-phoenix phoenix elixir-lang phoenix-framework

Featured Tools

Join Our Newsletter

Guide plugin development workflow — editing skills, agents, hooks, or eval framework in this repo. Use when modifying files in plugins/elixir-phoenix/, lab/eval/, or lab/autoresearch/. Ensures changes pass eval, lint, and tests before committing.

252 17

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Skill Monitor

Requirements

Usage

What Main Context Does

Step 1: Parse Arguments

Step 2: Load Metrics

Step 3: Compute Per-Skill Aggregates

Step 4: Display Dashboard

Step 5: Improvement Mode (--improve)

Step 6: Write Output

Iron Laws

Integration

References

Recommended Agent Skills

lab:autoresearch

promote

session-trends

cc-changelog

session-scan

plugin-dev-workflow