Agent skill
Rollout and Kill Switch
Comprehensive guide to safe agent deployment strategies including canary releases, feature flags, kill switches, and automated rollback mechanisms
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/rollout-and-kill-switch
SKILL.md
Rollout and Kill Switch
Why Controlled Rollouts?
Problem: Deploying agent changes to all users at once is risky
Risks
Bug affects all users
Performance issues at scale
Unexpected behavior
No easy rollback
Solution: Gradual Rollout
1% → Monitor → 10% → Monitor → 50% → Monitor → 100%
Issues detected early → Affect fewer users → Easy rollback
Rollout Strategies
Canary Deployment
Deploy new version to small % of users
Monitor metrics
If good, increase %
If bad, rollback
Timeline:
Day 1: 1% of users
Day 2: 5% of users
Day 3: 10% of users
Day 4: 25% of users
Day 5: 50% of users
Day 6: 100% of users
Blue-Green Deployment
Blue: Current version (100% traffic)
Green: New version (0% traffic)
Test green → Switch traffic → Green becomes blue
Instant rollback: Switch back to blue
Feature Flags
Deploy code to all users
Feature disabled by default
Enable for specific users/% of traffic
Monitor
Enable for all
Implementation
Feature Flags
class FeatureFlags:
def __init__(self):
self.flags = {}
def is_enabled(self, flag_name, user_id=None, default=False):
flag = self.flags.get(flag_name, {})
# Check if globally enabled
if flag.get("enabled", default):
return True
# Check rollout percentage
rollout_pct = flag.get("rollout_percentage", 0)
if rollout_pct > 0:
# Consistent hashing (same user always gets same result)
if (hash(user_id) % 100) < rollout_pct:
return True
# Check user whitelist
if user_id in flag.get("whitelist", []):
return True
return False
# Usage
flags = FeatureFlags()
flags.flags = {
"new_agent_version": {
"enabled": False,
"rollout_percentage": 10, # 10% of users
"whitelist": ["user_123", "user_456"] # Always enabled for these users
}
}
if flags.is_enabled("new_agent_version", user_id="user_789"):
# Use new agent version
agent = AgentV2()
else:
# Use old agent version
agent = AgentV1()
Database-Backed Feature Flags
CREATE TABLE feature_flags (
name VARCHAR(255) PRIMARY KEY,
enabled BOOLEAN DEFAULT FALSE,
rollout_percentage INT DEFAULT 0,
whitelist JSONB DEFAULT '[]',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
def is_feature_enabled(flag_name, user_id):
flag = db.query_one("""
SELECT enabled, rollout_percentage, whitelist
FROM feature_flags
WHERE name = %s
""", (flag_name,))
if not flag:
return False
if flag["enabled"]:
return True
if (hash(user_id) % 100) < flag["rollout_percentage"]:
return True
if user_id in flag["whitelist"]:
return True
return False
Kill Switch
Emergency Stop
class KillSwitch:
def __init__(self):
self.killed = False
def activate(self, reason):
self.killed = True
log_event(f"Kill switch activated: {reason}")
send_alert(f"🚨 Kill switch activated: {reason}")
def deactivate(self):
self.killed = False
log_event("Kill switch deactivated")
def is_active(self):
return self.killed
# Global kill switch
kill_switch = KillSwitch()
# In agent code
def run_agent(user_input):
if kill_switch.is_active():
return "Service temporarily unavailable. Please try again later."
# Normal agent logic
return agent.run(user_input)
# Activate kill switch
kill_switch.activate("High error rate detected")
Database-Backed Kill Switch
CREATE TABLE kill_switches (
name VARCHAR(255) PRIMARY KEY,
active BOOLEAN DEFAULT FALSE,
reason TEXT,
activated_by VARCHAR(100),
activated_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ DEFAULT NOW()
);
def is_kill_switch_active(name):
result = db.query_one("""
SELECT active FROM kill_switches WHERE name = %s
""", (name,))
return result["active"] if result else False
def activate_kill_switch(name, reason, activated_by):
db.execute("""
INSERT INTO kill_switches (name, active, reason, activated_by, activated_at)
VALUES (%s, TRUE, %s, %s, NOW())
ON CONFLICT (name) DO UPDATE
SET active = TRUE, reason = %s, activated_by = %s, activated_at = NOW()
""", (name, reason, activated_by, reason, activated_by))
send_alert(f"🚨 Kill switch '{name}' activated: {reason}")
Monitoring and Auto-Rollback
Monitor Metrics
def monitor_agent_metrics(version):
# Get metrics for last hour
metrics = db.query_one("""
SELECT
COUNT(*) as total_requests,
SUM(CASE WHEN success THEN 1 ELSE 0 END) as successes,
AVG(latency_ms) as avg_latency,
SUM(CASE WHEN error THEN 1 ELSE 0 END) as errors
FROM agent_logs
WHERE version = %s
AND timestamp > NOW() - INTERVAL '1 hour'
""", (version,))
success_rate = metrics["successes"] / metrics["total_requests"]
error_rate = metrics["errors"] / metrics["total_requests"]
return {
"success_rate": success_rate,
"error_rate": error_rate,
"avg_latency": metrics["avg_latency"]
}
Auto-Rollback on Failures
def auto_rollback_check(current_version, previous_version):
metrics = monitor_agent_metrics(current_version)
# Thresholds
if metrics["success_rate"] < 0.95: # < 95% success
rollback(current_version, previous_version, "Low success rate")
if metrics["error_rate"] > 0.05: # > 5% errors
rollback(current_version, previous_version, "High error rate")
if metrics["avg_latency"] > 5000: # > 5 seconds
rollback(current_version, previous_version, "High latency")
def rollback(from_version, to_version, reason):
# Deactivate current version
db.execute("""
UPDATE feature_flags
SET enabled = FALSE
WHERE name = %s
""", (f"agent_{from_version}",))
# Activate previous version
db.execute("""
UPDATE feature_flags
SET enabled = TRUE
WHERE name = %s
""", (f"agent_{to_version}",))
log_event(f"Auto-rolled back from {from_version} to {to_version}: {reason}")
send_alert(f"🔄 Auto-rollback: {from_version} → {to_version} ({reason})")
Gradual Rollout Automation
Increase Rollout Percentage
def gradual_rollout(flag_name, target_percentage=100, step=10, interval_hours=24):
"""
Gradually increase rollout percentage
Args:
flag_name: Feature flag name
target_percentage: Final percentage (default 100%)
step: Increase by this % each interval (default 10%)
interval_hours: Hours between increases (default 24)
"""
current_pct = get_rollout_percentage(flag_name)
while current_pct < target_percentage:
# Check metrics before increasing
metrics = monitor_agent_metrics(flag_name)
if metrics["success_rate"] < 0.95:
send_alert(f"⚠️ Rollout paused: Low success rate ({metrics['success_rate']:.2%})")
break
# Increase percentage
new_pct = min(current_pct + step, target_percentage)
set_rollout_percentage(flag_name, new_pct)
log_event(f"Increased {flag_name} rollout to {new_pct}%")
# Wait before next increase
time.sleep(interval_hours * 3600)
current_pct = new_pct
# Usage
gradual_rollout("new_agent_version", target_percentage=100, step=10, interval_hours=24)
Feature Flag Services
LaunchDarkly
import ldclient
from ldclient.config import Config
ldclient.set_config(Config("sdk-key-123"))
client = ldclient.get()
# Check flag
user = {"key": "user_123"}
show_new_feature = client.variation("new-agent-version", user, False)
if show_new_feature:
agent = AgentV2()
else:
agent = AgentV1()
Split.io
from splitio import get_factory
factory = get_factory("api-key-123")
client = factory.client()
# Check flag
treatment = client.get_treatment("user_123", "new-agent-version")
if treatment == "on":
agent = AgentV2()
else:
agent = AgentV1()
Unleash (Open Source)
from UnleashClient import UnleashClient
client = UnleashClient(
url="http://unleash.example.com/api",
app_name="my-agent",
custom_headers={"Authorization": "..."}
)
client.initialize_client()
# Check flag
if client.is_enabled("new-agent-version", {"userId": "user_123"}):
agent = AgentV2()
else:
agent = AgentV1()
Best Practices
1. Start Small (1-5%)
# Good
set_rollout_percentage("new_feature", 1) # Start with 1%
# Bad
set_rollout_percentage("new_feature", 50) # Too aggressive
2. Monitor Closely
# Monitor every 5 minutes during rollout
while rollout_in_progress:
metrics = monitor_agent_metrics("new_version")
if metrics["error_rate"] > threshold:
rollback()
time.sleep(300) # 5 minutes
3. Have Rollback Plan
# Always know how to rollback
rollback_plan = {
"method": "Feature flag toggle",
"steps": [
"1. Set feature_flag.enabled = False",
"2. Verify traffic switched to old version",
"3. Monitor for 1 hour"
],
"contact": "oncall@example.com"
}
4. Test Rollback
# Regularly test rollback procedure
def test_rollback():
# Enable new version
enable_feature("new_version")
assert is_feature_enabled("new_version")
# Rollback
disable_feature("new_version")
assert not is_feature_enabled("new_version")
# Verify old version works
response = agent_v1.run("test input")
assert response is not None
5. Communicate Changes
# Notify team before rollout
send_notification(
channel="#agent-ops",
message=f"Starting rollout of new agent version to 10% of users. Monitoring dashboard: {dashboard_url}"
)
Rollout Checklist
Pre-Rollout
☐ Code reviewed and approved
☐ Tests passing (unit, integration, e2e)
☐ Monitoring dashboard ready
☐ Rollback plan documented
☐ Team notified
☐ Oncall engineer assigned
During Rollout
☐ Start at 1-5%
☐ Monitor metrics every 5-15 minutes
☐ Check error logs
☐ Verify user feedback
☐ Gradually increase % (10%, 25%, 50%, 100%)
☐ Wait 24 hours between increases
Post-Rollout
☐ Verify 100% rollout successful
☐ Monitor for 48 hours
☐ Remove feature flag (if permanent)
☐ Document lessons learned
☐ Update runbooks
Summary
Rollout Strategies:
- Canary (gradual % increase)
- Blue-green (instant switch)
- Feature flags (selective enable)
Kill Switch:
- Emergency stop
- Database-backed
- Alert on activation
Auto-Rollback:
- Monitor metrics
- Rollback on failures
- Alert team
Feature Flag Services:
- LaunchDarkly
- Split.io
- Unleash (open source)
Best Practices:
- Start small (1-5%)
- Monitor closely
- Have rollback plan
- Test rollback
- Communicate changes
Rollout Timeline:
- Day 1: 1%
- Day 2: 5%
- Day 3: 10%
- Day 4: 25%
- Day 5: 50%
- Day 6: 100%
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?