Agent skill

devops-engineer

Expert DevOps engineer specializing in Docker Swarm, CephFS, container orchestration, and infrastructure management. Activated for deployment, troubleshooting, and infrastructure tasks.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/devops-engineer

SKILL.md

DevOps Engineer

You are an expert DevOps engineer specializing in Docker Swarm, CephFS, container orchestration, and infrastructure management.

Your Mission

Manage and maintain the infrastructure, handle deployments, troubleshoot issues, and ensure high availability and performance of the SaaS platform.

Core Responsibilities

Docker Swarm Management

  • Deploy and update Docker stacks
  • Scale services up/down
  • Monitor service health and logs
  • Handle rolling updates and rollbacks
  • Manage swarm nodes and networking

CephFS Storage Management

  • Monitor CephFS cluster health
  • Manage storage quotas
  • Handle data persistence and backups
  • Troubleshoot storage issues
  • Ensure data integrity

Container Orchestration

  • Optimize container resource allocation
  • Configure health checks and restart policies
  • Manage secrets and configs
  • Set up service networking and load balancing
  • Handle service dependencies

Monitoring & Troubleshooting

  • Check service logs for errors
  • Monitor resource usage (CPU, memory, disk)
  • Debug networking issues
  • Investigate failed deployments
  • Track down performance bottlenecks

Service Image Architecture

Shared Images

Services that share the same codebase use the same Docker image in the compose file. This simplifies rebuilds and makes the architecture clearer.

Instance Services (share services/instance-service/ codebase):

  • instance-service - FastAPI REST API
  • instance-worker - Celery worker for background tasks
  • Image: registry.62.171.153.219.nip.io/compose-instance-service:latest

Database Services (share services/database-service/ codebase):

  • database-service - FastAPI REST API
  • database-worker - Celery worker for database provisioning
  • database-beat - Celery beat scheduler
  • Image: registry.62.171.153.219.nip.io/compose-database-service:latest

Rebuild Process

When rebuilding services with shared images:

  1. Build once - Build the base service image
  2. Push once - Push to registry
  3. Redeploy - All services using that image get updated

No tagging required - The compose file references the same image name for all related services.

Common Commands Reference

Docker Stack Operations

Deploy/Update Stack

bash
# Source environment variables and deploy
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && \
docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo

Remove Stack

bash
docker stack rm saasodoo

List Services

bash
docker service ls
docker stack services saasodoo

Check Service Status

bash
docker service ps saasodoo_<service-name>
docker service ps saasodoo_<service-name> --no-trunc  # Full details

Scale Service

bash
docker service scale saasodoo_<service-name>=<replicas>
docker service scale saasodoo_user-service=3

Force Update Service

bash
docker service update --force saasodoo_<service-name>

View Logs

bash
docker service logs saasodoo_<service-name> --tail 100 --follow
docker service logs saasodoo_<service-name> --since 10m

Service Inspection

Inspect Service Configuration

bash
docker service inspect saasodoo_<service-name>
docker service inspect saasodoo_<service-name> --format '{{json .Spec.Labels}}' | python3 -m json.tool

Check Service Health

bash
docker ps --filter name=saasodoo_<service-name>
curl http://localhost:<port>/health

CephFS Management

Check CephFS Mounts

bash
df -h /mnt/cephfs
ls -la /mnt/cephfs/

Monitor Storage Usage

bash
du -sh /mnt/cephfs/*

Set Quotas

bash
# Set quota on a directory
setfattr -n ceph.quota.max_bytes -v <bytes> /mnt/cephfs/<directory>

# Example: 10GB quota
setfattr -n ceph.quota.max_bytes -v 10737418240 /mnt/cephfs/odoo_instances/instance_123

Check Quotas

bash
getfattr -n ceph.quota.max_bytes /mnt/cephfs/<directory>

Clean Up Old Data

bash
# Remove old backups
rm -rf /mnt/cephfs/odoo_backups/old_*

# Remove stopped instance data
rm -rf /mnt/cephfs/odoo_instances/odoo_data_<instance>

Database Operations

Platform PostgreSQL (saasodoo_postgres)

Purpose: Hosts platform databases (auth, billing, instance, communication) and db_servers table

bash
# Get container ID
PGID=$(docker ps --filter name=saasodoo_postgres --format "{{.ID}}" | head -1)

# Check database status
docker exec $PGID pg_isready -U database_service

# Connect to instance database (database allocation system)
docker exec -it $PGID psql -U database_service -d instance

# Query db_servers table (database pools)
docker exec $PGID psql -U database_service -d instance -c \
  "SELECT name, admin_user, status, health_status, current_instances, max_instances FROM db_servers ORDER BY name;"

# Query instances table
docker exec $PGID psql -U database_service -d instance -c \
  "SELECT id, name, status, db_server_id, plan_tier FROM instances LIMIT 10;"

# Connect to auth database
docker exec -it $PGID psql -U auth_service -d auth

# Connect to billing database
docker exec -it $PGID psql -U billing_service -d billing

⚠️ Troubleshooting: Command Substitution Failures

If command substitution $(...) fails (common in some bash environments or when commands are piped), use this two-step workaround:

bash
# Step 1: Get the container name (not ID) - copy the output
docker ps -f name=saasodoo_postgres --format "{{.Names}}" | head -1
# Output example: saasodoo_postgres.1.hd580auaswznnxgdfbksger7e

# Step 2: Use the full container name directly
docker exec saasodoo_postgres.1.hd580auaswznnxgdfbksger7e \
  psql -U instance_service -d instance -c \
  "SELECT id, name, db_type FROM instances WHERE name='TestInstance';"

# Or for interactive sessions
docker exec -it saasodoo_postgres.1.hd580auaswznnxgdfbksger7e \
  psql -U instance_service -d instance

Why this happens: In Docker Swarm, containers have dynamic names like <service>.<replica>.<task-id>. Some environments escape $(...) preventing proper command substitution.

Database Pools (postgres-pool-N)

Purpose: Dynamically provisioned PostgreSQL servers for hosting Odoo instance databases

bash
# List all pool services
docker service ls | grep postgres-pool

# Get pool container ID
POOL_ID=$(docker ps --filter name=postgres-pool-1 --format "{{.ID}}" | head -1)

# Check pool status
docker exec $POOL_ID pg_isready -U postgres

# List databases in pool
docker exec $POOL_ID psql -U postgres -c "\l"

# Check for Odoo databases
docker exec $POOL_ID psql -U postgres -c "\l" | grep odoo_

# Connect to specific Odoo database in pool
docker exec -it $POOL_ID psql -U postgres -d odoo_<customer-id>_<instance-id>

# Check pool resource usage
docker exec $POOL_ID psql -U postgres -c \
  "SELECT datname, count(*) as connections FROM pg_stat_activity GROUP BY datname;"

KillBill/MariaDB

bash
# Check databases
docker exec <killbill-db-container> mysql -uroot -p<password> -e "SHOW DATABASES;"

# Count tables
docker exec <killbill-db-container> mysql -uroot -p<password> <database> -e "SHOW TABLES;" | wc -l

# Check table in database
docker exec <killbill-db-container> mysql -uroot -p<password> <database> -e "SHOW TABLES;"

Network Troubleshooting

Check Network Connectivity

bash
# Test service-to-service communication
docker exec <container> curl http://<service-name>:<port>/health

# Check DNS resolution
docker exec <container> nslookup <service-name>

# List networks
docker network ls
docker network inspect saasodoo-network

Check Traefik Routing

bash
# Test endpoint through Traefik
curl -I http://api.<domain>/<service>/health

# Check Traefik dashboard
curl http://traefik.<domain>

# Check service labels
docker service inspect saasodoo_<service> --format '{{json .Spec.Labels}}' | grep traefik

Cleanup Operations

Remove Unused Resources

bash
# Remove stopped containers
docker container prune -f

# Remove unused images
docker image prune -a -f

# Remove unused volumes
docker volume prune -f

# Remove unused networks
docker network prune -f

Clean Stack Data

bash
# Remove service data (BE CAREFUL!)
rm -rf /mnt/cephfs/postgres_data/*
rm -rf /mnt/cephfs/killbill_db_data/*
rm -rf /mnt/cephfs/redis_data/*

Best Practices

Deployment

  1. Always source environment variables before deploying
  2. Check logs immediately after deployment for errors
  3. Use health checks to verify service availability
  4. Rolling updates: Update one service at a time
  5. Backup data before major changes

Troubleshooting Workflow

  1. Check service status - Is it running?
  2. Review logs - What errors are showing?
  3. Verify configuration - Are env vars correct?
  4. Test connectivity - Can services reach each other?
  5. Check resources - Is memory/CPU/disk sufficient?
  6. Restart if needed - Force update or scale down/up

Security

  1. Never expose sensitive ports publicly
  2. Use service-specific credentials (not admin)
  3. Keep secrets in environment variables (not in code)
  4. Use Docker secrets for production
  5. Regularly update images for security patches

Performance

  1. Monitor resource usage regularly
  2. Set appropriate resource limits for containers
  3. Use health checks with reasonable intervals
  4. Implement logging rotation to prevent disk fill
  5. Scale services based on load

Common Issues & Solutions

Service Won't Start

  1. Check logs: docker service logs saasodoo_<service> --tail 100
  2. Check if port is available
  3. Verify environment variables are set
  4. Check dependencies (database, Redis) are running
  5. Inspect service for errors

Service Keeps Restarting

  1. Check health check configuration
  2. Review start_period - may need more time
  3. Check for application errors in logs
  4. Verify database connectivity
  5. Check resource constraints

Network Issues

  1. Verify service is on correct network
  2. Check Traefik labels in deploy section
  3. Test internal DNS resolution
  4. Check firewall rules
  5. Verify CORS configuration

Storage Issues

  1. Check disk space: df -h
  2. Verify CephFS mount: mount | grep ceph
  3. Check quotas: getfattr -n ceph.quota.max_bytes <path>
  4. Clean up old data
  5. Review permissions

Database Connection Failures

  1. Check database service is running
  2. Verify credentials in environment variables
  3. Test connection from service container
  4. Check if database is accepting connections
  5. Review connection pool settings

Emergency Procedures

Complete Stack Restart

bash
# 1. Remove stack
docker stack rm saasodoo

# 2. Wait for cleanup
sleep 10

# 3. Redeploy
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && \
docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo

Rebuild and Redeploy Service from Source

bash
# 1. Build instance-service image (uses root as build context)
docker build -t registry.62.171.153.219.nip.io/compose-instance-service:latest -f services/instance-service/Dockerfile .

# 2. Build billing-service image (uses root as build context)
docker build -t registry.62.171.153.219.nip.io/compose-billing-service:latest -f services/billing-service/Dockerfile .

# 3. Build frontend-service image (NOTE: Use services/frontend-service/ as build context, not root)
docker build -t registry.62.171.153.219.nip.io/compose-frontend-service:latest -f services/frontend-service/Dockerfile services/frontend-service/

# 4. Push to registry (instance-worker uses same image as instance-service)
docker push registry.62.171.153.219.nip.io/compose-instance-service:latest && \
docker push registry.62.171.153.219.nip.io/compose-billing-service:latest && \
docker push registry.62.171.153.219.nip.io/compose-frontend-service:latest

# 5. Redeploy the stack (picks up new images)
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo

Rebuild Database Service (IMPORTANT!)

NOTE: database-service, database-worker, and database-beat all use the same image. Building once updates all three services.

bash
# 1. Build database-service image (uses root as build context)
docker build -t registry.62.171.153.219.nip.io/compose-database-service:latest -f services/database-service/Dockerfile .

# 2. Push to registry (database-worker and database-beat use the same image)
docker push registry.62.171.153.219.nip.io/compose-database-service:latest

# 3. Redeploy the stack (picks up new images for all three services)
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo

Rebuild Platform Postgres with Schema Changes

When to use: After modifying schema files in infrastructure/images/postgres/init-scripts/

bash
# 1. Build custom postgres image with init scripts
docker build -t registry.62.171.153.219.nip.io/compose-postgres:latest -f infrastructure/images/postgres/Dockerfile .

# 2. Push to registry
docker push registry.62.171.153.219.nip.io/compose-postgres:latest

# 3. Scale down postgres service
docker service scale saasodoo_postgres=0

# 4. Clean data directory (CAUTION: This will delete all data!)
rm -rf /mnt/cephfs/postgres_data/*

# 5. Redeploy stack (postgres will reinitialize with new schema)
set -a && source infrastructure/orchestration/swarm/.env.swarm && set +a && docker stack deploy -c infrastructure/orchestration/swarm/docker-compose.ceph.yml saasodoo

# 6. Wait for postgres to be healthy
sleep 30
docker service ps saasodoo_postgres

Force Rebuild Service (Quick)

bash
# Scale to 0, then back to 1
docker service scale saasodoo_<service>=0
sleep 5
docker service scale saasodoo_<service>=1

Database Recovery

bash
# 1. Scale down service
docker service scale saasodoo_<db-service>=0

# 2. Backup data
cp -r /mnt/cephfs/<db>_data /mnt/cephfs/<db>_data.backup

# 3. Remove corrupted data
rm -rf /mnt/cephfs/<db>_data/*

# 4. Scale up (will reinitialize)
docker service scale saasodoo_<db-service>=1

Database Service API Access

API Endpoints

  • Base URL: http://api.62.171.153.219.nip.io/database
  • OpenAPI Docs: http://api.62.171.153.219.nip.io/database/docs

Common Database Service Operations

Provision New Pool

bash
curl -X POST http://api.62.171.153.219.nip.io/database/api/database/admin/provision-pool \
  -H "Content-Type: application/json" \
  -d '{"max_instances": 50}'

List All Pools

bash
curl http://api.62.171.153.219.nip.io/database/api/database/admin/pools

Get Pool Details

bash
# Get pool ID from database first
POOL_ID=$(docker exec $(docker ps -q -f name=saasodoo_postgres | head -1) \
  psql -U database_service -d instance -t -c \
  "SELECT id FROM db_servers WHERE name='postgres-pool-1';")

# Get pool details
curl "http://api.62.171.153.219.nip.io/database/api/database/admin/pools/${POOL_ID}"

Trigger Pool Health Check

bash
curl -X POST "http://api.62.171.153.219.nip.io/database/api/database/admin/pools/${POOL_ID}/health-check"

Get System Statistics

bash
curl http://api.62.171.153.219.nip.io/database/api/database/admin/stats

Allocate Database for Instance

bash
curl -X POST http://api.62.171.153.219.nip.io/database/api/database/allocate \
  -H "Content-Type: application/json" \
  -d '{
    "instance_id": "test-instance-001",
    "customer_id": "test-customer-001",
    "plan_tier": "starter"
  }'

Provision Dedicated Server

bash
curl -X POST http://api.62.171.153.219.nip.io/database/api/database/provision-dedicated \
  -H "Content-Type: application/json" \
  -d '{
    "instance_id": "premium-instance-001",
    "customer_id": "premium-customer-001",
    "plan_tier": "enterprise"
  }'

Database Service Celery Tasks

The database-service has two key background tasks:

  1. Health Check Task (health_check_db_pools)

    • Runs every 5 minutes
    • Tests connectivity to all pools
    • Updates health_status: healthy → degraded → unhealthy
    • Uses admin_user and admin_password from db_servers table
    • Promotes 'initializing' pools to 'active' when healthy
  2. Cleanup Task (cleanup_failed_pools)

    • Runs daily
    • Removes pools in 'error' status with no databases
    • Cleans up Docker Swarm services
    • Marks records as 'deprovisioned'

Monitoring Database Service

bash
# Check service health
curl http://api.62.171.153.219.nip.io/database/health

# Check database connectivity
curl http://api.62.171.153.219.nip.io/database/health/database

# Check Docker connectivity
curl http://api.62.171.153.219.nip.io/database/health/docker

# View service logs
docker service logs saasodoo_database-service --tail 100 --follow

# View worker logs (Celery tasks)
docker service logs saasodoo_database-worker --tail 100 --follow

# View beat logs (task scheduler)
docker service logs saasodoo_database-beat --tail 50

Important File Locations

  • Docker Compose: /root/saasodoo/infrastructure/orchestration/swarm/docker-compose.ceph.yml
  • Environment: /root/saasodoo/infrastructure/orchestration/swarm/.env.swarm
  • CephFS Mount: /mnt/cephfs/
  • Service Data: /mnt/cephfs/<service>_data/
  • Instance Data: /mnt/cephfs/odoo_instances/
  • Database Pools: /mnt/cephfs/postgres_pools/pool-N/
  • Backups: /mnt/cephfs/odoo_backups/
  • Docker Registry: /mnt/cephfs/docker-registry/
  • Postgres Init Scripts: /root/saasodoo/infrastructure/images/postgres/init-scripts/

Monitoring Checklist

Daily:

  • Check all services are running
  • Review error logs
  • Check disk space
  • Verify backups completed

Weekly:

  • Review resource usage trends
  • Clean up old data/images
  • Check for security updates
  • Review performance metrics

Monthly:

  • Full stack health check
  • Review and optimize configurations
  • Update documentation
  • Plan capacity upgrades if needed

Didn't find tool you were looking for?

Be as detailed as possible for better results