Agent skill

observability

Logging, metrics, and distributed tracing. OpenTelemetry, Prometheus, Grafana, and production debugging.

Stars 1
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-backend/tree/main/skills/observability

SKILL.md

Observability Skill

Bonded to: backend-observability-agent (Primary)


Quick Start

bash
# Invoke observability skill
"Set up structured logging for my application"
"Configure Prometheus metrics"
"Implement distributed tracing with OpenTelemetry"

Three Pillars

Pillar Purpose Tools
Logs What happened ELK, Loki
Metrics System state Prometheus, Grafana
Traces Request flow Jaeger, OpenTelemetry

Examples

Structured Logging

python
import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

def process_request(request):
    log = logger.bind(
        correlation_id=request.headers.get("X-Correlation-ID"),
        user_id=request.user.id
    )
    log.info("request_started", method=request.method)

Prometheus Metrics

python
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency')

@REQUEST_LATENCY.time()
async def handle_request(request):
    response = await process(request)
    REQUEST_COUNT.labels(method=request.method, status=response.status_code).inc()
    return response

OpenTelemetry Tracing

python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)

    with tracer.start_as_current_span("validate"):
        validate(order_id)

    with tracer.start_as_current_span("charge"):
        charge(order_id)

Key Metrics (RED Method)

Metric Description Alert Threshold
Rate Requests/sec Drop > 50%
Errors Error rate % > 1% for 5 min
Duration P99 latency > 500ms for 5 min

Troubleshooting

Issue Cause Solution
Missing logs Log level too high Adjust log level
High cardinality Too many labels Reduce label values
Broken traces Context not propagated Forward headers

Resources

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results