Sponsored by

Find leads on Reddit on auto pilot

Agent skills
prompt-tuner

Agent skill

prompt-tuner

Improve embedded LLM system prompt based on evaluation test failures

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/prompt-tuner

SKILL.md

Prompt Tuner

Iteratively improve the embedded backend's system prompt to increase command generation accuracy.

When to Use

"tune the prompt"
"improve embedded accuracy"
"fix LLM command generation"
"run prompt tuning cycle"
After evaluation tests show low accuracy

Workflow

Phase 1: Baseline Measurement

Run evaluation tests with embedded backend:

bash

./target/release/caro test --backend embedded

Record:

Overall accuracy percentage
Category breakdown (Website Claim, Natural Variant, Edge Case)
List of failed test cases with expected vs actual commands

Phase 2: Failure Analysis

For each failed test case, identify the pattern:

Pattern	Example	Fix
Wrong path	`find /` instead of `find .`	Add rule: "ALWAYS use current directory '.'"
GNU flags	`--max-depth` on macOS	Add rule: "Use BSD-compatible flags"
Missing filters	No `-name "*.py"`	Add rule: "Include ALL relevant filters"
Time semantics	`-mtime -1` vs `-mtime 1`	Add clear mtime documentation
Quote style	Single vs double quotes	Usually equivalent, low priority
Flag order	`-type f -name` vs `-name -type f`	Usually equivalent, low priority

Phase 3: Prompt Improvement

Edit the system prompt in:

src/backends/embedded/embedded_backend.rs

Function: create_system_prompt()

Improvement strategies:

Add explicit rules for common failure patterns
Add concrete examples matching test cases
Use numbered rules with clear priorities
Keep prompt concise (LLMs perform worse with verbose prompts)

Phase 4: Verification

Build and re-run tests:

bash

cargo build --release
./target/release/caro test --backend embedded

Compare results:

Did overall accuracy improve?
Which categories improved/regressed?
Are remaining failures semantic equivalents (flag order, quotes)?

Phase 5: Iterate or Commit

If accuracy improved significantly:

bash

git add src/backends/embedded/embedded_backend.rs
git commit -m "feat(prompt): Improve embedded backend accuracy from X% to Y%"

If not improved or regressed:

Revert changes
Try different approach
Consider if test expectations need adjustment

Success Metrics

Level	Accuracy	Action
Poor	< 50%	Major prompt rewrite needed
Acceptable	50-70%	Targeted improvements
Good	70-85%	Minor tuning
Excellent	> 85%	Consider semantic equivalence in remaining failures

Tips

One change at a time: Make small, focused improvements
Check equivalents: Some "failures" are functionally equivalent commands
Test edge cases: Simple prompts often break on edge cases
BSD vs GNU: macOS uses BSD utilities, not GNU coreutils
Keep examples minimal: Too many examples can confuse small models

Example Session

User: /prompt-tuner

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/prompt-tuner
License: MIT License

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Didn't find tool you were looking for?