Agent skill
perf-optimization
Profile-driven performance optimization protocol. Use when profiling data (CPU, heap, trace) is available or when the user requests performance analysis. Covers methodology, pattern catalog, safety invariants, and when-to-stop heuristics. Language-specific tooling is in languages/*.md.
Install this agent skill to your Project
npx add-skill https://github.com/irahardianto/awesome-agv/tree/main/.agents/skills/perf-optimization
SKILL.md
Performance Optimization Skill
When to Use
- User provides profiling data (pprof, flamegraph, py-spy, Chrome DevTools, Dart DevTools)
- User asks to analyze or optimize performance of a specific component
- A benchmark regression is detected
- After deploying a new feature that touches a hot path
Core Methodology
graph LR
P1[Profile] --> P2[Analyze]
P2 --> P3[Prioritize]
P3 --> P4[Optimize]
P4 --> P5[Benchmark]
P5 --> P6{Improvement?}
P6 -->|Yes| P7[Verify & Ship]
P6 -->|No| P3
Step 1: Profile
Collect profiling data using the language-appropriate tool. Load the relevant languages/*.md module for exact commands.
Output: Raw profiling data (CPU profile, heap profile, or trace).
Step 2: Analyze
Read the profile. Focus on these principles (universal across all runtimes):
- Focus on
cum(cumulative): The total resources consumed by a function AND everything it called. This finds the expensive architectural flows. - Contextualize
flat: Resources consumed by the function itself only. If a runtime function (GC, malloc, syscall) has high flat time, trace it UP the call chain to find the user-land code that triggered it. - Ignore runtime noise: Scheduler overhead (
runtime.mcall,runtime.systemstack, GC workers) will always appear. Note if GC pressure is high, but don't try to "fix" the scheduler. - Separate benchmark artifacts from production cost: Test harness allocations (e.g.,
httptest.NewRequest,ResponseRecorder) inflate heap profiles but don't exist in production.
Output: Structured analysis document in docs/research_logs/{component}-perf-analysis.md.
Step 3: Prioritize
Rank fixes by impact/risk ratio:
| Priority | Criteria |
|---|---|
| Do first | Low risk, high impact (caching, pre-allocation, fast-reject) |
| Do second | Medium risk, high impact (library swap, algorithm change) |
| Do last | High risk, high impact (major refactor, custom implementation) |
| Skip | Any risk, low impact (micro-optimization below noise floor) |
Rule: If a fix requires more than 1 day AND saves < 20% on the hot path, defer it.
Step 4: Optimize
Implement one fix at a time. For each fix:
- Write tests FIRST (TDD — Red → Green → Refactor)
- Implement the fix
- Run all existing tests to verify no regression
- Benchmark immediately
Never batch multiple optimizations into one commit. Each fix must be independently verifiable and revertable.
Step 5: Benchmark
Compare before/after with the exact same benchmark configuration (same -benchtime, same -count, same machine load). Report:
ns/op(latency)B/op(memory per operation)allocs/op(heap allocations per operation)
Step 6: When to Stop
Stop optimizing when any of these are true:
- Remaining CPU is in hardware-optimized assembly (AES-NI, P-256, SIMD) — you cannot beat the hardware
- Remaining allocations are from the language runtime itself (GC, goroutine stacks, HTTP server internals)
- The fix requires a custom implementation of a well-audited library — the security/maintenance risk outweighs the perf gain
- The measured improvement is < 5% and within benchmark noise
Optimization Pattern Catalog
These are generic, language-agnostic patterns. Apply them when the profiling data shows the corresponding symptom.
Pattern: Result Caching
Symptom: Same expensive computation repeated with identical inputs (crypto verification, JSON parsing, regex compilation).
Fix: Cache results keyed by input hash. Use bounded LRU with TTL to prevent memory exhaustion.
Safety invariant: When caching security-sensitive results (auth tokens, permission checks):
- ALWAYS re-validate expiry/revocation on cache hit
- ALWAYS bound cache size (DoS protection)
- ALWAYS set TTL shorter than the security credential's validity period
Pattern: Pre-allocation
Symptom: High allocs/op from repeatedly constructing the same objects (option structs, config slices, header maps).
Fix: Build the object once at init time, share it read-only across requests. Safe for concurrent use if the object is immutable after construction.
Pattern: Fast-Reject / Short-Circuit
Symptom: Expensive validation path runs even for clearly invalid inputs.
Fix: Add a cheap structural pre-check before the expensive path. Examples: check string length before regex, count delimiters before parsing, check content-type before deserialization.
Pattern: Library Swap
Symptom: High allocation count or CPU in a third-party library's internal parsing/serialization.
Fix: Replace with a library that uses lower-allocation strategies (manual scanners vs encoding/json.Decoder, zero-copy parsing, arena allocation).
Safety invariant: When swapping security-critical libraries (JWT, TLS, crypto):
- Explicitly restrict accepted algorithms (prevent algorithm confusion attacks)
- Verify the replacement library is well-audited and actively maintained
- Run the full existing test suite — no behavioral change allowed
Pattern: Pooling
Symptom: High GC pressure from many short-lived objects of the same type being allocated and discarded rapidly.
Fix: Use an object pool (sync.Pool in Go, object pool in Java, arena in Rust) to reuse allocations.
Caveat: Only effective when objects are uniform in size and have a clear acquire/release lifecycle. Misuse creates subtle bugs.
Pattern: Batching
Symptom: Many small I/O operations (DB queries, HTTP calls, file writes) dominating wall-clock time.
Fix: Batch operations into fewer, larger calls. Examples: batch INSERT, pipeline Redis commands, buffer writes.
Pattern: Artifact Partitioning by Change Frequency
Symptom: Deploying a small change invalidates a large cached artifact (JS bundle, Docker image, compiled binary), forcing consumers to re-download/rebuild the entire thing.
Fix: Partition build artifacts by change frequency so that stable layers survive volatile deploys:
- Stable layer: dependencies, vendor libraries, base images — changes rarely
- Volatile layer: application code — changes on every deploy
Examples across stacks:
- JS/Bundler: Vite
manualChunks/ WebpacksplitChunksto isolate vendor libraries into separate chunks - Docker: multi-stage builds with
COPY go.mod+RUN go mod downloadBEFORECOPY . .— dependency layer caches across builds - Monorepo: separate packages by change frequency so CI only rebuilds what changed
Safety invariant: Total artifact size stays the same or slightly increases (chunk overhead). The benefit is on repeat consumption — stable layers serve from cache.
When NOT to apply: One-shot artifacts with no caching benefit (single-use CI, ephemeral environments).
Pattern: Dependency Discovery Parallelization
Symptom: Sequential resource discovery creates waterfalls — each resource is discovered only after the previous one completes (download → parse → discover next → download → ...).
Fix: Declare dependencies as early as possible so the system can fetch them in parallel:
- Move resource declarations upstream (earlier in the boot/parse sequence)
- Use explicit hints to bypass sequential discovery chains
Examples across stacks:
- Browser:
<link rel="preconnect">to establish connections before CSS/JS requests them; move CSS@importto HTML<link>for parallel discovery - Go:
go mod downloadbefore build to prefetch modules - DB: connection pool warm-up at startup instead of on first query
- DNS:
dns-prefetchhints for domains the app will contact
Safety invariant: Only pre-declare resources you WILL use. Unused preconnects/prefetches waste resources (TCP connections, DNS queries, module downloads).
Pattern: Concurrent-Fetch Dedup
Symptom: Network tab shows two identical API calls fired at the same time. Multiple UI components mount simultaneously and each independently calls the same fetch function.
Fix: Add a loading-state guard (semaphore) at the store/service layer:
async function fetchData() {
if (isLoading) return // ← drop duplicate in-flight request
isLoading = true
try { data = await api.getData() }
finally { isLoading = false }
}
When to apply: When the same data store is used by multiple co-mounted components (e.g., a navigation bar and a page view both calling fetchProfile() on mount).
Caveat: This is a simple semaphore, not request dedup. If the data needs refreshing after the in-flight call completes, the caller should retry. For advanced use cases, consider a proper request dedup cache (e.g., TanStack Query's staleTime).
Anti-Patterns (Things NOT to Do)
- Don't optimize runtime internals. If
runtime.mallocgcorruntime.gcBgMarkWorkeris high, fix the USER CODE that triggers allocations — don't try to tune the GC directly. - Don't replace battle-tested crypto with custom implementations. The performance ceiling of ECDSA/RSA is in the math. Accept it.
- Don't optimize based on gut feeling. Always profile first. Premature optimization is the root of all evil.
- Don't combine multiple optimizations into one commit. If a combined commit causes a regression, you can't isolate which fix is at fault.
- Don't disable security features for performance. Algorithm restriction, input validation, and expiry checks are non-negotiable.
- Don't profile without a stable baseline. Run benchmarks with fixed parameters (
-benchtime,-count, same machine load). Without a reproducible baseline, before/after comparisons are meaningless noise.
Language Modules
Load the relevant language module when working with a specific runtime:
| Module | Use when |
|---|---|
| Go | Go services, APIs, CLI tools |
| Rust | Rust binaries, libraries |
| Python | Python services, CLI, data pipelines |
| Frontend | Web frontends (JS/TS bundle, rendering, network) |
Contributing: After completing a perf optimization session, extract generalizable patterns from your
docs/research_logs/findings into this catalog. Project-specific details stay in the research log; reusable patterns belong here.
Profiling Scripts
Language-specific data extraction scripts live in scripts/:
| Script | Purpose |
|---|---|
| go-pprof.sh | Extract Go pprof CPU/heap profiles into agent-readable markdown |
| frontend-lighthouse.sh | Two modes: lighthouse (Core Web Vitals, needs Chrome) or bundle (Vite chunk analysis, always works) |
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
guardrails
Pre-flight checklist and post-implementation self-review protocol. Use before generating any code (pre-flight) and after writing code but before verification (self-review) to catch issues early.
code-review
Structured code review protocol for inspecting code quality against the full rule set. Use when auditing code written by yourself or another agent, during the /audit workflow, or when the user asks for a code review.
mobile-design
Generates distinctive, production-grade mobile interfaces for Flutter and React Native. Prioritizes platform-native patterns, adaptive layouts, and fluid motion. Use when building mobile apps, screens, widgets, or when the user requests to style or create visually striking mobile UI.
adr
Architecture Decision Record skill for documenting significant architectural decisions with context, options, and consequences. Use during the Research phase when choosing between approaches, or whenever the user asks to document an architectural decision.
frontend-design
Generates distinctive, production-grade frontend interfaces and artifacts (React, Vue, HTML/CSS). Prioritizes bold aesthetics, unique typography, and motion to avoid generic designs. Use when building websites, landing pages, dashboards, posters, or when the user requests to style, beautify, or create visually striking UI.
debugging-protocol
Comprehensive protocol for validating root causes of software issues. Use when you need to systematically debug a complex bug, flaky test, or unknown system behavior by forming hypotheses and validating them with specific tasks.
Didn't find tool you were looking for?