Agent skill
python-advanced-profiling-and-optimization
Advanced Python profiling with cProfile, py-spy, and optimization techniques including Python 3.13 features. PROACTIVELY activate for: (1) CPU profiling with cProfile, (2) Flame graph generation with py-spy, (3) Memory optimization, (4) Async performance patterns, (5) Python 3.13 JIT and free-threaded mode. Triggers: "python performance", "cProfile", "py-spy", "flame graph", "memory optimization", "async performance", "JIT", "GIL"
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/python-advanced-profiling-and-optimization
SKILL.md
Python Advanced Profiling and Optimization
This skill provides expert-level knowledge on profiling Python applications to identify performance bottlenecks in CPU and memory usage, and guides evaluation and application of experimental Python 3.13 features.
CPU Profiling with cProfile
cProfile is Python's built-in deterministic profiler, measuring execution time of all functions.
Basic Usage
import cProfile
import pstats
# Profile a specific function
def main():
# Your application code
result = expensive_operation()
return result
if __name__ == '__main__':
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()
# Print stats sorted by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
Command-Line Profiling
# Profile entire script
python -m cProfile -o output.prof script.py
# View results
python -m pstats output.prof
# Then in pstats shell:
# sort cumtime
# stats 20
Interpreting Output
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.156 0.000 0.234 0.000 utils.py:45(process_data)
5000 0.089 0.000 0.089 0.000 {built-in method json.loads}
Columns:
- ncalls: Number of calls
- tottime: Total time in function (excluding subcalls)
- percall: tottime / ncalls
- cumtime: Cumulative time (including subcalls) - Most important
- percall: cumtime / ncalls
What to look for:
- High cumtime functions are bottlenecks
- High ncalls with moderate cumtime suggests optimization opportunity
- Built-in functions with high time may indicate data structure issues
Context Manager for Profiling
import cProfile
import pstats
from contextlib import contextmanager
@contextmanager
def profile(output_file=None):
"""Context manager for profiling code blocks"""
profiler = cProfile.Profile()
profiler.enable()
try:
yield profiler
finally:
profiler.disable()
if output_file:
profiler.dump_stats(output_file)
else:
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)
# Usage
with profile('endpoint.prof'):
process_api_request(data)
Flame Graph Generation with py-spy
py-spy is a sampling profiler that can attach to running processes without code modification.
Installation
pip install py-spy
Basic Usage
# Profile running process by PID
py-spy record -o profile.svg --pid 12345
# Profile script directly
py-spy record -o profile.svg -- python script.py
# Live top-like view
py-spy top --pid 12345
Reading Flame Graphs
Flame graph structure:
- X-axis: Alphabetical order (NOT time)
- Y-axis: Stack depth (bottom = entry point, top = deepest call)
- Width: Percentage of total time
What to look for:
- Wide blocks: Functions consuming most time
- Tall stacks: Deep call chains (may indicate recursion issues)
- Flat plateaus: Hot paths through code
Profiling Production Services
# Attach to running Uvicorn/Gunicorn worker
# Find PID: ps aux | grep uvicorn
py-spy record -o api-profile.svg --pid $(pgrep -f uvicorn) --duration 30
# Profile with native extensions
py-spy record --native -o profile.svg --pid 12345
Comparing Before/After
# Before optimization
py-spy record -o before.svg -- python script.py
# After optimization
py-spy record -o after.svg -- python script.py
# Compare visually or use speedscope.app
Memory Optimization Techniques
Reducing memory footprint improves performance and allows scaling to more users.
Using slots
For classes with fixed attributes, __slots__ significantly reduces memory usage.
# BAD: Uses __dict__ (high memory overhead)
class User:
def __init__(self, name, email):
self.name = name
self.email = email
# GOOD: Uses __slots__ (lower memory)
class User:
__slots__ = ('name', 'email')
def __init__(self, name, email):
self.name = name
self.email = email
# Memory savings: ~40-50% per instance
When to use: Classes with many instances (models, data structures) When NOT to use: Classes needing dynamic attributes
Generators vs Lists
Generators process data lazily, reducing memory footprint.
# BAD: Loads entire dataset into memory
def get_all_users():
users = []
for row in db.query("SELECT * FROM users"):
users.append(process_user(row))
return users
total = sum(user.score for user in get_all_users())
# GOOD: Generator yields one at a time
def get_all_users():
for row in db.query("SELECT * FROM users"):
yield process_user(row)
total = sum(user.score for user in get_all_users())
Memory savings: Proportional to dataset size (can be orders of magnitude)
String Concatenation
# BAD: Creates new string object each iteration
result = ""
for item in items:
result += str(item) + "," # O(n^2) time and memory
# GOOD: Join at the end
result = ",".join(str(item) for item in items) # O(n)
Using array Module for Numeric Data
# BAD: List of integers (high memory)
numbers = [1, 2, 3, 4, 5] * 100000 # ~800 KB
# GOOD: Array of integers (low memory)
import array
numbers = array.array('i', [1, 2, 3, 4, 5] * 100000) # ~400 KB
# EVEN BETTER: NumPy for numerical computation
import numpy as np
numbers = np.array([1, 2, 3, 4, 5] * 100000) # ~400 KB + vectorized operations
Memory Profiling with memory_profiler
pip install memory-profiler
from memory_profiler import profile
@profile
def load_data():
data = [i for i in range(1000000)] # Line-by-line memory usage
processed = [x * 2 for x in data]
return processed
if __name__ == '__main__':
load_data()
Run with:
python -m memory_profiler script.py
Python 3.13 Docstring Stripping
Python 3.13 supports docstring stripping to reduce memory:
# Build Python with docstring stripping
./configure --with-pydoc=no
# Or use PYTHONOPTIMIZE=2
python -OO script.py # Strips docstrings
Memory savings: 5-10% in typical applications
Evaluating Experimental Python 3.13 Features
Python 3.13 introduces experimental performance features that require careful evaluation.
JIT Compiler
Python 3.13 includes an experimental JIT compiler for potential performance gains.
Enabling JIT
# Build Python with JIT enabled
./configure --enable-experimental-jit
make
Or use official Python 3.13+ with:
PYTHON_JIT=1 python script.py
When to Use JIT
Good candidates:
- CPU-bound workloads
- Tight loops with numeric computation
- Long-running processes
- Pure Python code (not C extensions)
Poor candidates:
- I/O-bound workloads
- Short-lived scripts
- Code dominated by C extensions (NumPy, etc.)
Benchmarking JIT
import time
def cpu_intensive_task(n):
"""Pure Python computation - good JIT candidate"""
result = 0
for i in range(n):
result += i * i
return result
# Benchmark without JIT
start = time.perf_counter()
result = cpu_intensive_task(10_000_000)
print(f"Without JIT: {time.perf_counter() - start:.3f}s")
# Run again with PYTHON_JIT=1
# Expected: 5-15% improvement for CPU-bound code
Current Limitations
- Modest gains: 5-15% for CPU-bound code (as of Python 3.13)
- Warm-up time: Initial runs may be slower
- Memory overhead: JIT compilation uses extra memory
- Experimental: May have bugs, not production-ready yet
Free-Threaded Mode (No-GIL)
Python 3.13 includes experimental support for running without the Global Interpreter Lock (GIL).
Enabling Free-Threaded Mode
# Build Python without GIL
./configure --disable-gil
make
Or use official builds:
python3.13t # 't' suffix indicates free-threaded build
When to Use
Good candidates:
- CPU-bound multi-threaded workloads
- Applications with parallel computation
- Web servers handling concurrent requests
- Data processing pipelines
Poor candidates:
- I/O-bound workloads (already benefits from asyncio)
- Single-threaded applications
- Code relying on C extensions without thread safety
Critical Considerations
C Extension Compatibility: Most C extensions are NOT thread-safe without GIL
Compatible: Pure Python, some newer C extensions Incompatible: NumPy, pandas, many others (as of early 2024)
Check compatibility:
import sys
print(sys._is_gil_enabled()) # False if running free-threaded
Benchmarking Multi-Threading
import threading
import time
def cpu_work(n):
"""CPU-intensive task"""
result = sum(i * i for i in range(n))
return result
def benchmark_threads(n_threads, work_size):
start = time.perf_counter()
threads = []
for _ in range(n_threads):
thread = threading.Thread(target=cpu_work, args=(work_size,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
elapsed = time.perf_counter() - start
print(f"{n_threads} threads: {elapsed:.3f}s")
# With GIL: No speedup (may be slower due to contention)
# Without GIL: Near-linear speedup for CPU-bound work
benchmark_threads(4, 10_000_000)
Migration Strategy
- Profile first: Confirm CPU-bound multi-threaded bottleneck
- Check dependencies: Verify all C extensions are compatible
- Test thoroughly: Free-threaded mode changes behavior
- Benchmark: Measure actual performance improvement
- Monitor stability: Watch for race conditions and crashes
AsyncIO Performance Patterns
Async code can be fast or slow depending on implementation.
Good Patterns
import asyncio
# GOOD: Concurrent I/O operations
async def fetch_all_users():
async with aiohttp.ClientSession() as session:
tasks = [
session.get(f'https://api.example.com/users/{i}')
for i in range(100)
]
responses = await asyncio.gather(*tasks)
return responses
# GOOD: Using asyncio.to_thread for blocking operations
async def process_request():
# Blocking call offloaded to thread pool
result = await asyncio.to_thread(blocking_cpu_work, data)
return result
Anti-Patterns
# BAD: Blocking the event loop
async def slow_handler():
time.sleep(5) # Blocks entire event loop!
return "done"
# GOOD: Use asyncio.sleep
async def fast_handler():
await asyncio.sleep(5) # Yields control
return "done"
# BAD: Not awaiting concurrent operations
async def sequential_fetches():
result1 = await fetch_user(1)
result2 = await fetch_user(2) # Waits for result1 first
return result1, result2
# GOOD: Concurrent fetches
async def parallel_fetches():
result1, result2 = await asyncio.gather(
fetch_user(1),
fetch_user(2) # Runs concurrently
)
return result1, result2
Anti-Patterns
Profiling I/O-Bound Code with CPU Profiler
CPU profilers show time spent in functions, not time waiting for I/O.
For I/O-bound code, use async profiling or tracing tools.
Enabling Experimental Features Without Measurement
JIT and no-GIL modes are experimental. Always benchmark before deploying.
Optimizing Without Profiling
"Premature optimization is the root of all evil." Profile first.
Ignoring Profiler Overhead
cProfile adds ~1-3% overhead. For very hot loops, use py-spy instead.
Using Lists When Generators Suffice
Loading gigabytes into memory when streaming would work.
Optimization Checklist
- Profile with cProfile to identify slow functions
- Generate flame graphs with py-spy for visual analysis
- Use slots for classes with many instances
- Replace lists with generators for large datasets
- Use array/NumPy for numeric data
- Fix string concatenation in loops (use join)
- Profile memory with memory_profiler if usage is high
- Check for blocking calls in asyncio (use asyncio.to_thread)
- Benchmark JIT if CPU-bound (Python 3.13+)
- Evaluate free-threaded mode only for multi-threaded CPU-bound workloads
- Validate C extension compatibility before disabling GIL
Performance Targets
- API response time: < 100ms (p50), < 500ms (p99)
- Memory per worker: < 512 MB typical
- CPU usage: < 70% average (headroom for spikes)
- GC pause time: < 10ms (minimize object churn)
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?