Agent skill

server-management

Server management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.

Stars 23,776
Forks 2,298

Install this agent skill to your Project

npx add-skill https://github.com/davila7/claude-code-templates/tree/main/cli-tool/components/skills/development/server-management

SKILL.md

Server Management

Server management principles for production operations. Learn to THINK, not memorize commands.


1. Process Management Principles

Tool Selection

Scenario Tool
Node.js app PM2 (clustering, reload)
Any app systemd (Linux native)
Containers Docker/Podman
Orchestration Kubernetes, Docker Swarm

Process Management Goals

Goal What It Means
Restart on crash Auto-recovery
Zero-downtime reload No service interruption
Clustering Use all CPU cores
Persistence Survive server reboot

2. Monitoring Principles

What to Monitor

Category Key Metrics
Availability Uptime, health checks
Performance Response time, throughput
Errors Error rate, types
Resources CPU, memory, disk

Alert Severity Strategy

Level Response
Critical Immediate action
Warning Investigate soon
Info Review daily

Monitoring Tool Selection

Need Options
Simple/Free PM2 metrics, htop
Full observability Grafana, Datadog
Error tracking Sentry
Uptime UptimeRobot, Pingdom

3. Log Management Principles

Log Strategy

Log Type Purpose
Application logs Debug, audit
Access logs Traffic analysis
Error logs Issue detection

Log Principles

  1. Rotate logs to prevent disk fill
  2. Structured logging (JSON) for parsing
  3. Appropriate levels (error/warn/info/debug)
  4. No sensitive data in logs

4. Scaling Decisions

When to Scale

Symptom Solution
High CPU Add instances (horizontal)
High memory Increase RAM or fix leak
Slow response Profile first, then scale
Traffic spikes Auto-scaling

Scaling Strategy

Type When to Use
Vertical Quick fix, single instance
Horizontal Sustainable, distributed
Auto Variable traffic

5. Health Check Principles

What Constitutes Healthy

Check Meaning
HTTP 200 Service responding
Database connected Data accessible
Dependencies OK External services reachable
Resources OK CPU/memory not exhausted

Health Check Implementation

  • Simple: Just return 200
  • Deep: Check all dependencies
  • Choose based on load balancer needs

6. Security Principles

Area Principle
Access SSH keys only, no passwords
Firewall Only needed ports open
Updates Regular security patches
Secrets Environment vars, not files
Audit Log access and changes

7. Troubleshooting Priority

When something's wrong:

  1. Check if running (process status)
  2. Check logs (error messages)
  3. Check resources (disk, memory, CPU)
  4. Check network (ports, DNS)
  5. Check dependencies (database, APIs)

8. Anti-Patterns

❌ Don't ✅ Do
Run as root Use non-root user
Ignore logs Set up log rotation
Skip monitoring Monitor from day one
Manual restarts Auto-restart config
No backups Regular backup schedule

Remember: A well-managed server is boring. That's the goal.

Expand your agent's capabilities with these related and highly-rated skills.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298
Explore
davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298
Explore
davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298
Explore
davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298
Explore
davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

23,776 2,298
Explore
davila7/claude-code-templates

behavioral-modes

AI operational modes (brainstorm, implement, debug, review, teach, ship, orchestrate). Use to adapt behavior based on task type.

23,776 2,298
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results