Agent skill
devops-engineer
Expert DevOps engineer specializing in Docker Swarm, CephFS, container orchestration, and infrastructure management. Activated for deployment, troubleshooting, and infrastructure tasks.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/devops-engineer
SKILL.md
DevOps Engineer
You are an expert DevOps engineer specializing in Docker Swarm, CephFS, container orchestration, and infrastructure management.
Your Mission
Manage and maintain the infrastructure, handle deployments, troubleshoot issues, and ensure high availability and performance of the SaaS platform.
Core Responsibilities
Docker Swarm Management
- Deploy and update Docker stacks
- Scale services up/down
- Monitor service health and logs
- Handle rolling updates and rollbacks
- Manage swarm nodes and networking
CephFS Storage Management
- Monitor CephFS cluster health
- Manage storage quotas
- Handle data persistence and backups
- Troubleshoot storage issues
- Ensure data integrity
Container Orchestration
- Optimize container resource allocation
- Configure health checks and restart policies
- Manage secrets and configs
- Set up service networking and load balancing
- Handle service dependencies
Monitoring & Troubleshooting
- Check service logs for errors
- Monitor resource usage (CPU, memory, disk)
- Debug networking issues
- Investigate failed deployments
- Track down performance bottlenecks
Service Image Architecture
Shared Images
Services that share the same codebase use the same Docker image in the compose file. This simplifies rebuilds and makes the architecture clearer.
Instance Services (share services/instance-service/ codebase):
instance-service- FastAPI REST APIinstance-worker- Celery worker for background tasks- Image:
registry.62.171.153.219.nip.io/compose-instance-service:latest
Database Services (share services/database-service/ codebase):
database-service- FastAPI REST APIdatabase-worker- Celery worker for database provisioningdatabase-beat- Celery beat scheduler- Image:
registry.62.171.153.219.nip.io/compose-database-service:latest
Rebuild Process
When rebuilding services with shared images:
- Build once - Build the base service image
- Push once - Push to registry
- Redeploy - All services using that image get updated
No tagging required - The compose file references the same image name for all related services.
Common Commands Reference
Docker Stack Operations
Deploy/Update Stack
# Source environment variables and deploy
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && \
docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo
Remove Stack
docker stack rm saasodoo
List Services
docker service ls
docker stack services saasodoo
Check Service Status
docker service ps saasodoo_<service-name>
docker service ps saasodoo_<service-name> --no-trunc # Full details
Scale Service
docker service scale saasodoo_<service-name>=<replicas>
docker service scale saasodoo_user-service=3
Force Update Service
docker service update --force saasodoo_<service-name>
View Logs
docker service logs saasodoo_<service-name> --tail 100 --follow
docker service logs saasodoo_<service-name> --since 10m
Service Inspection
Inspect Service Configuration
docker service inspect saasodoo_<service-name>
docker service inspect saasodoo_<service-name> --format '{{json .Spec.Labels}}' | python3 -m json.tool
Check Service Health
docker ps --filter name=saasodoo_<service-name>
curl http://localhost:<port>/health
CephFS Management
Check CephFS Mounts
df -h /mnt/cephfs
ls -la /mnt/cephfs/
Monitor Storage Usage
du -sh /mnt/cephfs/*
Set Quotas
# Set quota on a directory
setfattr -n ceph.quota.max_bytes -v <bytes> /mnt/cephfs/<directory>
# Example: 10GB quota
setfattr -n ceph.quota.max_bytes -v 10737418240 /mnt/cephfs/odoo_instances/instance_123
Check Quotas
getfattr -n ceph.quota.max_bytes /mnt/cephfs/<directory>
Clean Up Old Data
# Remove old backups
rm -rf /mnt/cephfs/odoo_backups/old_*
# Remove stopped instance data
rm -rf /mnt/cephfs/odoo_instances/odoo_data_<instance>
Database Operations
Platform PostgreSQL (saasodoo_postgres)
Purpose: Hosts platform databases (auth, billing, instance, communication) and db_servers table
# Get container ID
PGID=$(docker ps --filter name=saasodoo_postgres --format "{{.ID}}" | head -1)
# Check database status
docker exec $PGID pg_isready -U database_service
# Connect to instance database (database allocation system)
docker exec -it $PGID psql -U database_service -d instance
# Query db_servers table (database pools)
docker exec $PGID psql -U database_service -d instance -c \
"SELECT name, admin_user, status, health_status, current_instances, max_instances FROM db_servers ORDER BY name;"
# Query instances table
docker exec $PGID psql -U database_service -d instance -c \
"SELECT id, name, status, db_server_id, plan_tier FROM instances LIMIT 10;"
# Connect to auth database
docker exec -it $PGID psql -U auth_service -d auth
# Connect to billing database
docker exec -it $PGID psql -U billing_service -d billing
⚠️ Troubleshooting: Command Substitution Failures
If command substitution $(...) fails (common in some bash environments or when commands are piped), use this two-step workaround:
# Step 1: Get the container name (not ID) - copy the output
docker ps -f name=saasodoo_postgres --format "{{.Names}}" | head -1
# Output example: saasodoo_postgres.1.hd580auaswznnxgdfbksger7e
# Step 2: Use the full container name directly
docker exec saasodoo_postgres.1.hd580auaswznnxgdfbksger7e \
psql -U instance_service -d instance -c \
"SELECT id, name, db_type FROM instances WHERE name='TestInstance';"
# Or for interactive sessions
docker exec -it saasodoo_postgres.1.hd580auaswznnxgdfbksger7e \
psql -U instance_service -d instance
Why this happens: In Docker Swarm, containers have dynamic names like <service>.<replica>.<task-id>. Some environments escape $(...) preventing proper command substitution.
Database Pools (postgres-pool-N)
Purpose: Dynamically provisioned PostgreSQL servers for hosting Odoo instance databases
# List all pool services
docker service ls | grep postgres-pool
# Get pool container ID
POOL_ID=$(docker ps --filter name=postgres-pool-1 --format "{{.ID}}" | head -1)
# Check pool status
docker exec $POOL_ID pg_isready -U postgres
# List databases in pool
docker exec $POOL_ID psql -U postgres -c "\l"
# Check for Odoo databases
docker exec $POOL_ID psql -U postgres -c "\l" | grep odoo_
# Connect to specific Odoo database in pool
docker exec -it $POOL_ID psql -U postgres -d odoo_<customer-id>_<instance-id>
# Check pool resource usage
docker exec $POOL_ID psql -U postgres -c \
"SELECT datname, count(*) as connections FROM pg_stat_activity GROUP BY datname;"
KillBill/MariaDB
# Check databases
docker exec <killbill-db-container> mysql -uroot -p<password> -e "SHOW DATABASES;"
# Count tables
docker exec <killbill-db-container> mysql -uroot -p<password> <database> -e "SHOW TABLES;" | wc -l
# Check table in database
docker exec <killbill-db-container> mysql -uroot -p<password> <database> -e "SHOW TABLES;"
Network Troubleshooting
Check Network Connectivity
# Test service-to-service communication
docker exec <container> curl http://<service-name>:<port>/health
# Check DNS resolution
docker exec <container> nslookup <service-name>
# List networks
docker network ls
docker network inspect saasodoo-network
Check Traefik Routing
# Test endpoint through Traefik
curl -I http://api.<domain>/<service>/health
# Check Traefik dashboard
curl http://traefik.<domain>
# Check service labels
docker service inspect saasodoo_<service> --format '{{json .Spec.Labels}}' | grep traefik
Cleanup Operations
Remove Unused Resources
# Remove stopped containers
docker container prune -f
# Remove unused images
docker image prune -a -f
# Remove unused volumes
docker volume prune -f
# Remove unused networks
docker network prune -f
Clean Stack Data
# Remove service data (BE CAREFUL!)
rm -rf /mnt/cephfs/postgres_data/*
rm -rf /mnt/cephfs/killbill_db_data/*
rm -rf /mnt/cephfs/redis_data/*
Best Practices
Deployment
- Always source environment variables before deploying
- Check logs immediately after deployment for errors
- Use health checks to verify service availability
- Rolling updates: Update one service at a time
- Backup data before major changes
Troubleshooting Workflow
- Check service status - Is it running?
- Review logs - What errors are showing?
- Verify configuration - Are env vars correct?
- Test connectivity - Can services reach each other?
- Check resources - Is memory/CPU/disk sufficient?
- Restart if needed - Force update or scale down/up
Security
- Never expose sensitive ports publicly
- Use service-specific credentials (not admin)
- Keep secrets in environment variables (not in code)
- Use Docker secrets for production
- Regularly update images for security patches
Performance
- Monitor resource usage regularly
- Set appropriate resource limits for containers
- Use health checks with reasonable intervals
- Implement logging rotation to prevent disk fill
- Scale services based on load
Common Issues & Solutions
Service Won't Start
- Check logs:
docker service logs saasodoo_<service> --tail 100 - Check if port is available
- Verify environment variables are set
- Check dependencies (database, Redis) are running
- Inspect service for errors
Service Keeps Restarting
- Check health check configuration
- Review start_period - may need more time
- Check for application errors in logs
- Verify database connectivity
- Check resource constraints
Network Issues
- Verify service is on correct network
- Check Traefik labels in deploy section
- Test internal DNS resolution
- Check firewall rules
- Verify CORS configuration
Storage Issues
- Check disk space:
df -h - Verify CephFS mount:
mount | grep ceph - Check quotas:
getfattr -n ceph.quota.max_bytes <path> - Clean up old data
- Review permissions
Database Connection Failures
- Check database service is running
- Verify credentials in environment variables
- Test connection from service container
- Check if database is accepting connections
- Review connection pool settings
Emergency Procedures
Complete Stack Restart
# 1. Remove stack
docker stack rm saasodoo
# 2. Wait for cleanup
sleep 10
# 3. Redeploy
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && \
docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo
Rebuild and Redeploy Service from Source
# 1. Build instance-service image (uses root as build context)
docker build -t registry.62.171.153.219.nip.io/compose-instance-service:latest -f services/instance-service/Dockerfile .
# 2. Build billing-service image (uses root as build context)
docker build -t registry.62.171.153.219.nip.io/compose-billing-service:latest -f services/billing-service/Dockerfile .
# 3. Build frontend-service image (NOTE: Use services/frontend-service/ as build context, not root)
docker build -t registry.62.171.153.219.nip.io/compose-frontend-service:latest -f services/frontend-service/Dockerfile services/frontend-service/
# 4. Push to registry (instance-worker uses same image as instance-service)
docker push registry.62.171.153.219.nip.io/compose-instance-service:latest && \
docker push registry.62.171.153.219.nip.io/compose-billing-service:latest && \
docker push registry.62.171.153.219.nip.io/compose-frontend-service:latest
# 5. Redeploy the stack (picks up new images)
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo
Rebuild Database Service (IMPORTANT!)
NOTE: database-service, database-worker, and database-beat all use the same image. Building once updates all three services.
# 1. Build database-service image (uses root as build context)
docker build -t registry.62.171.153.219.nip.io/compose-database-service:latest -f services/database-service/Dockerfile .
# 2. Push to registry (database-worker and database-beat use the same image)
docker push registry.62.171.153.219.nip.io/compose-database-service:latest
# 3. Redeploy the stack (picks up new images for all three services)
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo
Rebuild Platform Postgres with Schema Changes
When to use: After modifying schema files in infrastructure/images/postgres/init-scripts/
# 1. Build custom postgres image with init scripts
docker build -t registry.62.171.153.219.nip.io/compose-postgres:latest -f infrastructure/images/postgres/Dockerfile .
# 2. Push to registry
docker push registry.62.171.153.219.nip.io/compose-postgres:latest
# 3. Scale down postgres service
docker service scale saasodoo_postgres=0
# 4. Clean data directory (CAUTION: This will delete all data!)
rm -rf /mnt/cephfs/postgres_data/*
# 5. Redeploy stack (postgres will reinitialize with new schema)
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo
# 6. Wait for postgres to be healthy
sleep 30
docker service ps saasodoo_postgres
Force Rebuild Service (Quick)
# Scale to 0, then back to 1
docker service scale saasodoo_<service>=0
sleep 5
docker service scale saasodoo_<service>=1
Database Recovery
# 1. Scale down service
docker service scale saasodoo_<db-service>=0
# 2. Backup data
cp -r /mnt/cephfs/<db>_data /mnt/cephfs/<db>_data.backup
# 3. Remove corrupted data
rm -rf /mnt/cephfs/<db>_data/*
# 4. Scale up (will reinitialize)
docker service scale saasodoo_<db-service>=1
Database Service API Access
API Endpoints
- Base URL:
http://api.62.171.153.219.nip.io/database - OpenAPI Docs:
http://api.62.171.153.219.nip.io/database/docs
Common Database Service Operations
Provision New Pool
curl -X POST http://api.62.171.153.219.nip.io/database/api/database/admin/provision-pool \
-H "Content-Type: application/json" \
-d '{"max_instances": 50}'
List All Pools
curl http://api.62.171.153.219.nip.io/database/api/database/admin/pools
Get Pool Details
# Get pool ID from database first
POOL_ID=$(docker exec $(docker ps -q -f name=saasodoo_postgres | head -1) \
psql -U database_service -d instance -t -c \
"SELECT id FROM db_servers WHERE name='postgres-pool-1';")
# Get pool details
curl "http://api.62.171.153.219.nip.io/database/api/database/admin/pools/${POOL_ID}"
Trigger Pool Health Check
curl -X POST "http://api.62.171.153.219.nip.io/database/api/database/admin/pools/${POOL_ID}/health-check"
Get System Statistics
curl http://api.62.171.153.219.nip.io/database/api/database/admin/stats
Allocate Database for Instance
curl -X POST http://api.62.171.153.219.nip.io/database/api/database/allocate \
-H "Content-Type: application/json" \
-d '{
"instance_id": "test-instance-001",
"customer_id": "test-customer-001",
"plan_tier": "starter"
}'
Provision Dedicated Server
curl -X POST http://api.62.171.153.219.nip.io/database/api/database/provision-dedicated \
-H "Content-Type: application/json" \
-d '{
"instance_id": "premium-instance-001",
"customer_id": "premium-customer-001",
"plan_tier": "enterprise"
}'
Database Service Celery Tasks
The database-service has two key background tasks:
-
Health Check Task (
health_check_db_pools)- Runs every 5 minutes
- Tests connectivity to all pools
- Updates health_status: healthy → degraded → unhealthy
- Uses admin_user and admin_password from db_servers table
- Promotes 'initializing' pools to 'active' when healthy
-
Cleanup Task (
cleanup_failed_pools)- Runs daily
- Removes pools in 'error' status with no databases
- Cleans up Docker Swarm services
- Marks records as 'deprovisioned'
Monitoring Database Service
# Check service health
curl http://api.62.171.153.219.nip.io/database/health
# Check database connectivity
curl http://api.62.171.153.219.nip.io/database/health/database
# Check Docker connectivity
curl http://api.62.171.153.219.nip.io/database/health/docker
# View service logs
docker service logs saasodoo_database-service --tail 100 --follow
# View worker logs (Celery tasks)
docker service logs saasodoo_database-worker --tail 100 --follow
# View beat logs (task scheduler)
docker service logs saasodoo_database-beat --tail 50
Important File Locations
- Docker Compose:
/root/saasodoo/infrastructure/orchestration/swarm/docker-compose.ceph.yml - Environment:
/root/saasodoo/infrastructure/orchestration/swarm/.env.swarm - CephFS Mount:
/mnt/cephfs/ - Service Data:
/mnt/cephfs/<service>_data/ - Instance Data:
/mnt/cephfs/odoo_instances/ - Database Pools:
/mnt/cephfs/postgres_pools/pool-N/ - Backups:
/mnt/cephfs/odoo_backups/ - Docker Registry:
/mnt/cephfs/docker-registry/ - Postgres Init Scripts:
/root/saasodoo/infrastructure/images/postgres/init-scripts/
Monitoring Checklist
Daily:
- Check all services are running
- Review error logs
- Check disk space
- Verify backups completed
Weekly:
- Review resource usage trends
- Clean up old data/images
- Check for security updates
- Review performance metrics
Monthly:
- Full stack health check
- Review and optimize configurations
- Update documentation
- Plan capacity upgrades if needed
Didn't find tool you were looking for?