Agent skill
optimize-prompt-gepa
Optimizes prompts using full GEPA methodology (Genetic-Pareto Evolution). Use when user wants to improve a prompt's accuracy on test cases, mentions "optimize prompt", "improve prompt", or has examples of desired input/output pairs. Implements Pareto frontier selection, trace-based reflection, and crossover mutations.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/testing/optimize-prompt-gepa
SKILL.md
You (Claude) are the optimizer. You run prompts, capture traces, reflect on failures, and evolve improvements.
<quick_start> User provides:
- A seed prompt to optimize
- Test cases (input + expected output pairs)
You run the full GEPA loop and return the optimized prompt.
Example:
Seed: "Extract action items"
Test case:
Input: "Meeting notes: John will prepare the report by Friday. Sarah to review."
Expected: "- John: Prepare report (Due: Friday)\n- Sarah: Review (Due: unspecified)"
After GEPA optimization:
"Extract action items from text. Think step by step:
1. Identify each person mentioned
2. Find what they committed to do
3. Extract any deadline mentioned
Format each item as:
- [Person]: [Task] (Due: [deadline or 'unspecified'])
Rules:
- Skip items without clear ownership
- If deadline is vague (e.g., 'soon', 'later'), mark as 'unspecified'
- One line per action item"
</quick_start>
- Seed prompt - What prompt do you want to optimize?
- Test cases - Examples of input and expected output
- Minimum: 1 example (I'll generate more synthetically)
- Recommended: 5-10 examples for robust optimization
Optional:
- Target score (default: 90%)
- Max iterations (default: 10)
- Diversity weight (default: 0.3) - How much to favor diverse solutions
Please provide your prompt and test cases.
<data_structures>
# Prompt Candidate
Candidate = {
id: string,
prompt: string,
scores: {test_case_id: float}, # Score per test case
avg_score: float,
parent_ids: [string], # For tracking lineage
mutation_type: "reflection" | "crossover" | "seed"
}
# Pareto Frontier
ParetoFrontier = [Candidate] # Candidates not dominated by any other
# Test Case with Trace
EvaluatedCase = {
input: string,
expected: string,
actual: string,
trace: string, # Full reasoning chain
score: float,
feedback: string
}
</data_structures>
Assign unique IDs to test cases (tc_1, tc_2, etc.)
Given these examples:
{for each test_case: input -> expected}
Generate 5 more examples that:
- Follow the EXACT same output format
- Cover edge cases:
* Empty/null inputs
* Multiple items
* Missing information
* Ambiguous cases
- Use different names, numbers, contexts
Return as JSON array: [{"input": "...", "expected": "..."}, ...]
Add generated examples to test_cases with IDs.
seed_candidate = {
id: "c_0",
prompt: seed_prompt,
scores: {},
parent_ids: [],
mutation_type: "seed"
}
For each test_case:
-
Run with trace capture:
{seed_prompt} Input: {test_case.input} Think through this step by step, then provide your final answer. ## Reasoning: [Your step-by-step thinking] ## Answer: [Your final output] -
Parse trace (reasoning) and answer separately
-
Score answer against expected (0-10)
-
Store: seed_candidate.scores[test_case.id] = score/10
Calculate avg_score = mean(all scores)
Initialize:
pareto_frontier = [seed_candidate]all_candidates = [seed_candidate]best_avg_score = avg_score
Report: "Baseline score: {avg_score:.0%}"
4a. Pareto Selection Select parent candidate using tournament selection with diversity bonus:
For 3 random candidates from pareto_frontier:
Calculate selection_score = avg_score + diversity_weight * uniqueness
(uniqueness = how different this candidate's strengths are from others)
Select candidate with highest selection_score
selected_parent = winner
4b. Mini-batch Evaluation Select mini-batch of 3 test cases, prioritizing:
- Cases where selected_parent scored lowest (exploitation)
- 1 random case (exploration)
Run selected_parent.prompt on mini-batch WITH TRACES Collect: [{input, expected, actual, trace, score, feedback}, ...]
mini_batch_score = average score Report: "Iteration {i}: Testing '{selected_parent.id}' on mini-batch: {mini_batch_score:.0%}"
4c. Early Success Check IF mini_batch_score >= target_score: Run full validation on ALL test cases IF full_avg >= target_score: Report: "✓ Target reached: {full_avg:.0%}" GOTO step 5 (output)
4d. Trace-Based Reflection Collect failures (score < 0.8) with their TRACES:
## Reflection Task
Current prompt:
{selected_parent.prompt}
## Failed Cases Analysis
{for each failure:}
### Case {id}
**Input:** {input}
**Expected:** {expected}
**Actual:** {actual}
**ReasoningTrace:** {trace}
**Score:** {score}/10
**Feedback:** {feedback}
---
## Analysis Questions
1. **Trace Analysis**: Where in the reasoning did the model go wrong?
- Did it misunderstand the task?
- Did it miss information in the input?
- Did it apply wrong formatting?
2. **Pattern Recognition**: What patterns do you see across failures?
- Common misunderstandings
- Systematic format errors
- Missing edge case handling
3. **Root Cause**: What's the SINGLE most impactful fix?
4. **Specific Rules**: List 3-5 explicit rules to add to the prompt.
Provide your analysis:
Save reflection_analysis
4e. Generate Mutations Create 2 new candidates:
Mutation 1: Reflection-based
Current prompt:
{selected_parent.prompt}
Analysis of failures:
{reflection_analysis}
Create an improved prompt that:
- Addresses ALL identified issues
- Includes explicit rules from analysis
- Adds step-by-step reasoning instructions if helpful
- Specifies exact output format with examples
Write ONLY the new prompt (no explanation):
Mutation 2: Crossover (if pareto_frontier has 2+ candidates)
You have two successful prompts with different strengths:
Prompt A (excels on: {cases where A > B}):
{candidate_a.prompt}
Prompt B (excels on: {cases where B > A}):
{candidate_b.prompt}
Create a NEW prompt that combines the best elements of both.
Merge their rules, keep the most specific instructions from each.
Write ONLY the merged prompt:
Create new candidates:
- mutation_1 = {id: "c_{n}", prompt: reflection_result, parent_ids: [selected_parent.id], mutation_type: "reflection"}
- mutation_2 = {id: "c_{n+1}", prompt: crossover_result, parent_ids: [a.id, b.id], mutation_type: "crossover"}
4f. Full Evaluation of New Candidates For each new candidate: Run on ALL test cases with traces Calculate scores per test case and avg_score
4g. Update Pareto Frontier For each new candidate: Add to all_candidates
Check Pareto dominance:
- Candidate A dominates B if A scores >= B on ALL test cases AND > on at least one
Update pareto_frontier:
- Add new candidate if not dominated by any existing
- Remove any existing candidates now dominated by new one
4h. Track Best IF any new candidate has avg_score > best_avg_score: best_avg_score = new avg_score Report: "✓ New best: {best_avg_score:.0%} (candidate {id})" ELSE: Report: "No improvement. Pareto frontier size: {len(pareto_frontier)}"
4i. Diversity Check IF all candidates in pareto_frontier have similar prompts (>80% overlap): Report: "⚠ Low diversity. Injecting random mutation." Create random_mutation with aggressive changes Add to next iteration's candidates
END FOR
Present final results:
## GEPA Optimization Results
### Performance
| Metric | Value |
|--------|-------|
| Baseline Score | {seed_candidate.avg_score:.0%} |
| Final Score | {best_candidate.avg_score:.0%} |
| Improvement | +{improvement:.0%} |
| Iterations | {iterations_run} |
| Candidates Evaluated | {len(all_candidates)} |
| Pareto Frontier Size | {len(pareto_frontier)} |
### Original Prompt
{seed_prompt}
### Optimized Prompt
{best_candidate.prompt}
### Per-Case Performance
| Test Case | Before | After | Δ |
|-----------|--------|-------|---|
{for each test_case:}
| {id} | {seed_scores[id]:.0%} | {best_scores[id]:.0%} | {delta} |
### Key Discoveries
{Summarize main patterns found during reflection:}
1. {discovery_1}
2. {discovery_2}
3. {discovery_3}
### Alternative Prompts (Pareto Frontier)
{If pareto_frontier has multiple candidates with different strengths:}
- **{candidate.id}**: Best for {cases where it excels} ({avg:.0%} avg)
<scoring_guide>
Scoring Outputs (0-10)
| Score | Criteria |
|---|---|
| 10 | Perfect match: correct content AND exact format |
| 9 | Correct content, trivial format difference (whitespace, punctuation) |
| 7-8 | Correct content, minor format difference (ordering, capitalization) |
| 5-6 | Mostly correct content, wrong format structure |
| 3-4 | Partial content, significant omissions |
| 1-2 | Minimal correct content |
| 0 | Completely wrong or empty |
Feedback Template
Score: X/10
✓ Correct: [what's right]
✗ Wrong: [what's wrong]
→ Fix: [specific instruction that would fix it]
Be STRICT about format matching. Format errors indicate missing instructions in the prompt. </scoring_guide>
<trace_analysis_guide>
How to Analyze Reasoning Traces
When examining a trace, look for:
-
Task Understanding
- Did the model correctly interpret what to do?
- Did it miss any requirements?
-
Information Extraction
- Did it find all relevant info in the input?
- Did it hallucinate information not present?
-
Logic Errors
- Where did the reasoning go wrong?
- What assumption was incorrect?
-
Format Application
- Did it know the expected format?
- Did it apply it correctly?
Red Flags in Traces
- "I assume..." → Missing explicit instruction
- "I'm not sure if..." → Ambiguous requirement
- Skipping steps → Need more structured guidance
- Wrong interpretation → Need examples in prompt </trace_analysis_guide>
<pareto_frontier_guide>
Pareto Dominance
Candidate A dominates Candidate B if:
- A.scores[tc] >= B.scores[tc] for ALL test cases
- A.scores[tc] > B.scores[tc] for AT LEAST ONE test case
Why Pareto Matters
Different prompts may excel on different cases:
- Prompt A: Great at edge cases, weak on simple cases
- Prompt B: Great at simple cases, weak on edge cases
Both belong in the Pareto frontier. Crossover can combine their strengths.
Frontier Maintenance
- Max size: 5 candidates (prevent explosion)
- If over limit, keep most diverse set using k-medoids </pareto_frontier_guide>
<edge_cases> Only 1 test case: Generate 5+ synthetic examples covering edge cases before starting.
Perfect baseline (100%): Report success, no optimization needed. Suggest additional edge cases to test robustness.
No improvement after 5 iterations:
- Increase diversity_weight to 0.5
- Try aggressive mutations (rewrite from scratch based on learnings)
- Check if test cases have conflicting requirements
Pareto frontier explodes (>5 candidates):
- Keep only the 5 most diverse candidates
- Prioritize candidates with unique strengths
Crossover produces worse results:
- Reduce crossover frequency
- Focus on reflection-based mutations
Oscillating scores (up/down/up):
- Indicates conflicting requirements in test cases
- Review test cases for consistency
- Consider splitting into sub-tasks </edge_cases>
<success_criteria> Optimization completes when:
- ✓ Full dataset score >= target_score (default 90%), OR
- ✓ Max iterations reached, OR
- ✓ No improvement for 3 consecutive iterations (early stopping)
Always return:
- Best prompt from Pareto frontier
- Score improvement trajectory
- Key discoveries from trace analysis
- Alternative prompts if Pareto frontier has multiple strong candidates </success_criteria>
<example_session>
Example: Action Item Extraction
User Input:
Seed prompt: "Extract action items from meeting notes"
Test cases:
1. Input: "John will send the report by Friday"
Expected: "- John: Send report (Due: Friday)"
2. Input: "We should discuss the budget sometime"
Expected: ""
3. Input: "Sarah and Mike to review the proposal by EOD"
Expected: "- Sarah: Review proposal (Due: EOD)\n- Mike: Review proposal (Due: EOD)"
GEPA Execution:
Iteration 1: Baseline 40%
- tc_1: 8/10 (format slightly off)
- tc_2: 0/10 (returned items when should be empty)
- tc_3: 4/10 (missed second person)
Reflection: "Model doesn't know to skip vague items or split multiple people"
Mutation 1 (reflection): Added rules for ownership and multiple people
Iteration 2: 70%
- tc_2 now correct (empty)
- tc_3 still failing (format)
Crossover with seed: Merged format examples
Iteration 3: 90% ✓ Target reached
Final Optimized Prompt:
Extract action items from meeting notes.
Step-by-step:
1. Find each person with a specific commitment
2. Identify their task and any deadline
3. Format as: "- [Person]: [Task] (Due: [deadline])"
Rules:
- SKIP vague items without clear ownership ("we should...", "someone needs to...")
- If multiple people share a task, create separate lines for each
- If no deadline mentioned, use "Due: unspecified"
- If NO valid action items exist, return empty string
Example:
Input: "John and Mary will review docs by Monday. We should improve process."
Output:
- John: Review docs (Due: Monday)
- Mary: Review docs (Due: Monday)
</example_session>
Didn't find tool you were looking for?