Agent skill
chaos-engineering
Test system resilience through controlled failures. Use when validating fault tolerance, disaster recovery, or system reliability. Covers chaos experiments.
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/chaos-engineering
SKILL.md
Chaos Engineering
Principles
- Build a Hypothesis: Define expected behavior
- Minimize Blast Radius: Start small
- Run in Production: Real conditions matter
- Automate: Make experiments repeatable
- Minimize Impact: Have abort conditions
Experiment Process
- Steady State: Define normal metrics
- Hypothesis: "System will maintain X under condition Y"
- Introduce Variables: Inject failure
- Observe: Compare to steady state
- Analyze: Confirm or disprove hypothesis
Common Experiments
Network Failures
bash
# Add latency
tc qdisc add dev eth0 root netem delay 100ms
# Packet loss
tc qdisc add dev eth0 root netem loss 10%
# Remove
tc qdisc del dev eth0 root
Resource Exhaustion
bash
# CPU stress
stress --cpu 4 --timeout 60s
# Memory stress
stress --vm 2 --vm-bytes 1G --timeout 60s
# Disk fill
dd if=/dev/zero of=/tmp/fill bs=1M count=1024
Service Failures
- Kill processes
- Restart containers
- Terminate instances
- Block dependencies
Chaos Tools
- Chaos Monkey: Random instance termination
- Gremlin: Comprehensive chaos platform
- Litmus: Kubernetes chaos engineering
- Chaos Mesh: Cloud-native chaos
Experiment Template
markdown
## Experiment: [Name]
### Hypothesis
If [condition], then [expected behavior].
### Steady State
- Metric A: [baseline value]
- Metric B: [baseline value]
### Method
1. [Step 1]
2. [Step 2]
3. [Step 3]
### Abort Conditions
- If [condition], stop immediately
### Results
[What happened]
### Findings
[What we learned]
Safety Rules
- Start in non-production
- Have rollback ready
- Monitor continuously
- Communicate with team
- Document everything
Didn't find tool you were looking for?