Agent skill
proof-of-work
Proof artifact generation patterns for task validation. Covers screenshots, test results, deployments, and confidence scoring.
Install this agent skill to your Project
npx add-skill https://github.com/MadAppGang/claude-code/tree/main/plugins/autopilot/skills/proof-of-work
SKILL.md
plugin: autopilot updated: 2026-01-20
Proof-of-Work
Version: 0.1.0 Purpose: Generate validation artifacts for autonomous task completion Status: Phase 1
When to Use
Use this skill when you need to:
- Generate proof artifacts after task completion
- Capture screenshots for UI verification
- Parse and report test results
- Calculate confidence scores for task validation
- Determine if a task can be auto-approved
Overview
Proof-of-work is the mechanism that validates task completion. Every finished task must include verifiable artifacts that demonstrate the work was done correctly.
Proof Types by Task
Bug Fix Proof
| Artifact | Required | Purpose |
|---|---|---|
| Git diff | Yes | Show minimal, focused changes |
| Test results | Yes | All tests passing |
| Regression test | Yes | Specific test for the bug |
| Error log (before/after) | Optional | Visual evidence |
Feature Proof
| Artifact | Required | Purpose |
|---|---|---|
| Screenshots | Yes | Visual verification |
| Test results | Yes | Functionality works |
| Coverage report | Yes | >= 80% coverage |
| Build output | Yes | Builds successfully |
| Deployment URL | Optional | Live demo |
UI Change Proof
| Artifact | Required | Purpose |
|---|---|---|
| Desktop screenshot | Yes | 1920x1080 view |
| Mobile screenshot | Yes | 375x667 view |
| Tablet screenshot | Yes | 768x1024 view |
| Accessibility score | Yes | >= 80 Lighthouse |
| Visual regression | Optional | BackstopJS diff |
Screenshot Capture
Playwright Pattern:
import { chromium } from 'playwright';
async function captureScreenshots(url: string, outputDir: string) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Desktop
await page.setViewportSize({ width: 1920, height: 1080 });
await page.goto(url);
await page.waitForLoadState('networkidle');
await page.screenshot({
path: `${outputDir}/desktop.png`,
fullPage: true,
});
// Mobile
await page.setViewportSize({ width: 375, height: 667 });
await page.goto(url);
await page.waitForLoadState('networkidle');
await page.screenshot({
path: `${outputDir}/mobile.png`,
fullPage: true,
});
// Tablet
await page.setViewportSize({ width: 768, height: 1024 });
await page.goto(url);
await page.waitForLoadState('networkidle');
await page.screenshot({
path: `${outputDir}/tablet.png`,
fullPage: true,
});
await browser.close();
}
Confidence Scoring
Algorithm:
interface ProofArtifacts {
testResults?: { passed: number; total: number };
buildSuccessful?: boolean;
lintErrors?: number;
screenshots?: string[];
testCoverage?: number;
performanceScore?: number;
}
function calculateConfidence(artifacts: ProofArtifacts): number {
let score = 0;
// Tests (40 points)
if (artifacts.testResults) {
if (artifacts.testResults.passed === artifacts.testResults.total) {
score += 40;
}
}
// Build (20 points)
if (artifacts.buildSuccessful) {
score += 20;
}
// Coverage (20 points)
if (artifacts.testCoverage) {
if (artifacts.testCoverage >= 80) score += 20;
else if (artifacts.testCoverage >= 60) score += 15;
else if (artifacts.testCoverage >= 40) score += 10;
else score += 5;
}
// Screenshots (10 points)
if (artifacts.screenshots) {
if (artifacts.screenshots.length >= 3) score += 10;
else if (artifacts.screenshots.length >= 1) score += 5;
}
// Lint (10 points)
if (artifacts.lintErrors === 0) {
score += 10;
}
return score;
}
Confidence Thresholds
| Confidence | Action |
|---|---|
| >= 95% | Auto-approve (In Review -> Done) |
| 80-94% | Manual review required |
| < 80% | Validation failed, iterate |
Proof Summary Template
# Proof of Work
**Task**: {issue_id}
**Type**: {task_type}
**Confidence**: {score}%
## Test Results
- Total: {total}
- Passed: {passed}
- Failed: {failed}
- Coverage: {coverage}%
## Build
- Status: {status}
- Duration: {duration}
## Screenshots
- Desktop: proof/desktop.png
- Mobile: proof/mobile.png
- Tablet: proof/tablet.png
## Artifacts
- test-results.txt
- coverage.json
- build-output.txt
Examples
Example 1: Feature Proof Generation
const proof = {
testResults: { passed: 15, total: 15 },
buildSuccessful: true,
lintErrors: 0,
screenshots: ['desktop.png', 'mobile.png', 'tablet.png'],
testCoverage: 85,
};
const confidence = calculateConfidence(proof);
// 40 (tests) + 20 (build) + 20 (coverage) + 10 (screenshots) + 10 (lint) = 100%
Example 2: Partial Proof
const proof = {
testResults: { passed: 12, total: 15 }, // Some failing
buildSuccessful: true,
lintErrors: 2,
screenshots: ['desktop.png'],
testCoverage: 65,
};
const confidence = calculateConfidence(proof);
// 0 (tests fail) + 20 (build) + 15 (coverage) + 5 (1 screenshot) + 0 (lint errors) = 40%
// Result: Validation failed, must iterate
Best Practices
- Always capture screenshots for UI work
- Run full test suite, not just affected tests
- Include coverage report for features
- Build must pass before any proof is valid
- Store proofs in session directory for debugging
- Generate proof summary in markdown for Linear comments
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
test-skill
A test skill for validation testing. Use when testing skill parsing and validation logic.
bad-skill
claudish-usage
CRITICAL - Guide for using Claudish CLI ONLY through sub-agents to run Claude Code with OpenRouter models (Grok, GPT-5, Gemini, MiniMax). NEVER run Claudish directly in main context unless user explicitly requests it. Use when user mentions external AI models, Claudish, OpenRouter, or alternative models. Includes mandatory sub-agent delegation patterns, agent selection guide, file-based instructions, and strict rules to prevent context window pollution.
release
Plugin release process for MAG Claude Plugins marketplace. Covers version bumping, marketplace.json updates, git tagging, and common mistakes. Use when releasing new plugin versions or troubleshooting update issues.
claudish-integration
openrouter-trending-models
Fetch trending programming models from OpenRouter rankings. Use when selecting models for multi-model review, updating model recommendations, or researching current AI coding trends. Provides model IDs, context windows, pricing, and usage statistics from the most recent week.
Didn't find tool you were looking for?