Agent skill

pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

View SKILL.md on GitHub Repository

Stars 232

Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/pinchbench/pinchbench

Metadata

Additional technical details for this skill

author: pinchbench
version: 1.0.0
homepage: https://pinchbench.com
repository: https://github.com/pinchbench/skill

SKILL.md

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

Python 3.10+
uv package manager
OpenClaw instance (this agent)

Quick Start

bash

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

Task	Category	Description
`task_00_sanity`	Basic	Verify agent works
`task_01_calendar`	Productivity	Calendar event creation
`task_02_stock`	Research	Stock price lookup
`task_03_blog`	Writing	Blog post creation
`task_04_weather`	Coding	Weather script
`task_05_summary`	Analysis	Document summarization
`task_06_events`	Research	Conference research
`task_07_email`	Writing	Email drafting
`task_08_memory`	Memory	Context retrieval
`task_09_files`	Files	File structure creation
`task_10_workflow`	Integration	Multi-step API workflow
`task_11_clawdhub`	Skills	ClawHub interaction
`task_12_skill_search`	Skills	Skill discovery
`task_13_image_gen`	Creative	Image generation
`task_14_humanizer`	Writing	Text humanization
`task_15_daily_summary`	Productivity	Daily digest
`task_16_email_triage`	Email	Inbox triage
`task_17_email_search`	Email	Email search
`task_18_market_research`	Research	Market analysis
`task_19_spreadsheet_summary`	Analysis	Spreadsheet analysis
`task_20_eli5_pdf_summary`	Analysis	PDF simplification
`task_21_openclaw_comprehension`	Knowledge	OpenClaw docs comprehension
`task_22_second_brain`	Memory	Knowledge management

Command Line Options

Option	Description
`--model`	Model identifier (e.g., `anthropic/claude-sonnet-4`)
`--suite`	`all`, `automated-only`, or comma-separated task IDs
`--output-dir`	Results directory (default: `results/`)
`--timeout-multiplier`	Scale task timeouts for slower models
`--runs`	Number of runs per task for averaging
`--no-upload`	Skip uploading to leaderboard
`--register`	Request new API token for submissions
`--upload FILE`	Upload previous results JSON

Token Registration

To submit results to the leaderboard:

bash

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

bash

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

YAML frontmatter (id, name, category, grading_type, timeout)
Prompt section
Expected behavior
Grading criteria
Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

Model rankings by overall score
Per-task breakdowns
Historical performance trends

Maintainer

aiskillstore Core maintainer

Source details

Full Name: aiskillstore/marketplace
Branch: main
Path in repo: skills/pinchbench/pinchbench
Topics: claude-code claude codex-skills skills codex claude-skills ai-skills

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

aiskillstore/marketplace

perigon-backend

Perigon ASP.NET Core + EF Core + Aspire conventions

232 15

Explore

aiskillstore/marketplace

perigon-agent

Pointers for Copilot/agents to apply Perigon conventions

232 15

Explore

aiskillstore/marketplace

perigon-angular

Angular 21+ standalone/Material/signal conventions for Perigon WebApp

232 15

Explore

aiskillstore/marketplace

fastapi-mastery

Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.

232 15

Explore

aiskillstore/marketplace

context7-efficient

Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.

232 15

Explore

aiskillstore/marketplace

browser-use

Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.

232 15

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

PinchBench Benchmark Skill

Prerequisites

Quick Start

Available Tasks (23)

Command Line Options

Token Registration

Results

Adding Custom Tasks

Leaderboard

Recommended Agent Skills

perigon-backend

perigon-agent

perigon-angular

fastapi-mastery

context7-efficient

browser-use