Agent skill

analyzing-apm-data

Monitor application performance using the RED methodology (Rate, Errors, Duration) with Observe. Use when analyzing service health, investigating errors, tracking latency, or building APM dashboards. Covers when to use metrics vs spans, combining RED signals, and troubleshooting workflows. Cross-references working-with-intervals, aggregating-gauge-metrics, and analyzing-tdigest-metrics skills.

Stars 1
Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/rustomax/observe-community-mcp/tree/main/skills/analyzing-apm-data

SKILL.md

Analyzing APM Data

Application Performance Monitoring (APM) tracks the health and performance of your services using telemetry data. This skill teaches the RED methodology and how to choose between pre-aggregated metrics and raw span data for different APM use cases.

When to Use This Skill

  • Monitoring microservices health and performance
  • Building service dashboards with Rate, Errors, Duration (RED)
  • Investigating production incidents and errors
  • Tracking latency and throughput across services
  • Setting up SLO (Service Level Objective) monitoring
  • Understanding when to use metrics vs spans for APM
  • Root cause analysis of service degradation
  • Analyzing service dependencies and call patterns
  • Identifying database bottlenecks and slow queries
  • Understanding which services depend on a failing component

Prerequisites

  • Access to Observe tenant with OpenTelemetry span data
  • Understanding of distributed tracing concepts
  • Familiarity with metrics queries (see aggregating-gauge-metrics, analyzing-tdigest-metrics)
  • Familiarity with interval queries (see working-with-intervals)

Key Concepts

The RED Methodology

RED is a standard approach for monitoring microservices health:

R - Rate: Request volume (requests per second/minute/hour)

  • How much traffic is each service handling?
  • Are there traffic spikes or drops?

E - Errors: Error count and error rate (percentage)

  • How many requests are failing?
  • Which services have the highest error rates?

D - Duration: Latency percentiles (p50, p95, p99)

  • How fast are services responding?
  • Are there latency spikes?

Metrics vs Spans Decision Tree

Observe provides two ways to analyze APM data:

Metrics (Pre-aggregated, Fast)

  • Request counts: span_call_count_5m
  • Error counts: span_error_count_5m
  • Latency percentiles: span_*_duration_tdigest_5m
  • Pre-aggregated at 5-minute intervals

Spans (Raw Data, Detailed)

  • Dataset: OpenTelemetry/Span (interface: otel_span)
  • Complete trace information
  • All span attributes available

Decision Matrix:

Need Use Why
Dashboard, 24h+ range Metrics Fast, efficient, pre-aggregated
"How many errors?" Metrics Quick counts, good for alerts
"What went wrong?" Spans Error messages, stack traces
Latency overview Metrics (tdigest) Efficient percentiles
Filter by endpoint Spans Full attribute filtering
Real-time monitoring Metrics Consistent performance
Root cause analysis Spans Complete context, trace IDs
Service dependencies Spans Trace relationships, span types
Database bottlenecks Spans DB attributes, query details

Recommended Workflow:

  1. Start with metrics (fast overview)
  2. Identify anomalies (high errors, latency spikes)
  3. Drill down with spans (detailed investigation)

Discovery Workflow

Step 1: Find APM metrics

discover_context("request error latency", result_type="metric")

Common metrics:

  • span_call_count_5m - Request counts (gauge)
  • span_error_count_5m - Error counts (gauge)
  • span_sn_service_node_duration_tdigest_5m - Latency (tdigest)

Step 2: Find span dataset

discover_context("opentelemetry span trace")

Look for: OpenTelemetry/Span (interface: otel_span)

Step 3: Get detailed schemas

discover_context(metric_name="span_call_count_5m")
discover_context(dataset_id="<span_dataset_id>")

RED Methodology Patterns

Pattern 1: Rate - Request Volume (Metrics)

Get total requests per service over 24 hours:

opal
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)
| sort desc(total_requests)
| limit 10

Output: One row per service with total request count.

Use case: Identify highest-traffic services, detect traffic drops.

Pattern 2: Errors - Error Count (Metrics)

Get total errors per service:

opal
align options(bins: 1), errors:sum(m("span_error_count_5m"))
aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0
| sort desc(total_errors)
| limit 10

Output: Services ranked by error count (including zeros with fill).

Use case: Quick error overview, dashboard widgets.

Pattern 3: Duration - Latency Percentiles (Metrics)

Get latency percentiles per service:

opal
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
          p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          p99:tdigest_quantile(tdigest_combine(combined), 0.99),
          group_by(service_name)
| make_col p50_ms:p50/1000000, p95_ms:p95/1000000, p99_ms:p99/1000000
| sort desc(p95_ms)
| limit 10

Output: Latency percentiles in milliseconds per service.

Use case: SLO tracking, latency monitoring, performance comparison.

Note: Uses tdigest double-combine pattern (see analyzing-tdigest-metrics skill).

Pattern 4: Combined RED Dashboard (Metrics)

Calculate Rate, Errors, and Error Rate together:

opal
align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(error_rate)
| limit 10

Output: Services with request count, error count, and error rate percentage.

Use case: Complete service health dashboard.

Pattern 5: Trending - Requests Over Time (Metrics)

Track request volume over time for charting:

opal
align 1h, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_hour:sum(rate), group_by(service_name)
| filter service_name = "frontend"

Output: Hourly time-series data for one service.

Use case: Dashboard charts, trend analysis, capacity planning.

Pipe rule: Time-series (1h) requires pipe | after align.

Detailed Investigation Patterns (Spans)

Pattern 6: Rate from Spans (Detailed)

Count requests from raw spans (1-hour window recommended):

opal
make_col svc:service_name
| statsby request_count:count(), group_by(svc)
| sort desc(request_count)
| limit 10

Use case: When you need exact counts for short time ranges or want to filter by span attributes first.

Note: Use shorter time ranges (1h) for performance.

Pattern 7: Error Analysis with Messages (Spans)

Get error details including messages:

opal
filter error = true
| make_col svc:service_name,
          error_msg:string(error_message),
          span:span_name,
          status:string(status_code)
| statsby error_count:count(), group_by(svc, error_msg, span)
| sort desc(error_count)
| limit 10

Output: Error counts WITH full error messages and span names.

Use case: Root cause analysis - understand WHY errors happened.

Key advantage: Spans show WHAT went wrong, metrics only show HOW MANY.

Pattern 8: Latency from Spans (Filtered)

Calculate latency percentiles for specific span types:

opal
filter span_type = "Service entry point"
| make_col svc:service_name, dur_ms:duration / 1ms
| statsby p50:percentile(dur_ms, 0.50),
          p95:percentile(dur_ms, 0.95),
          p99:percentile(dur_ms, 0.99),
          group_by(svc)
| sort desc(p95)
| limit 10

Output: Latency percentiles for user-facing requests only.

Use case: Focus on end-user experience, exclude internal calls.

Duration conversion: duration / 1ms converts nanoseconds to milliseconds.

Pattern 9: Error Rate from Spans

Calculate error rate using raw span data:

opal
make_col svc:service_name, is_error:if(error = true, 1, 0)
| statsby total:count(), error_count:sum(is_error), group_by(svc)
| make_col error_rate:float64(error_count) / float64(total) * 100.0
| sort desc(error_rate)
| limit 10

Output: Services with total requests, errors, and error rate.

Use case: When you need to filter spans first (e.g., only specific endpoints).

Complete APM Workflow Example

Scenario: You notice performance issues and need to investigate.

Step 1: Check overall service health (Metrics - Fast)

opal
align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(error_rate)

Finding: cartservice has 0.48% error rate - investigate further.

Step 2: Get error details (Spans - Detailed)

opal
filter error = true
| filter service_name = "cartservice"
| make_col error_msg:string(error_message),
          span:span_name,
          endpoint:string(attributes.\"http.target\")
| statsby error_count:count(), group_by(error_msg, span, endpoint)
| sort desc(error_count)

Finding: "Can't access cart storage. System.ApplicationException: Wasn't able to connect to redis" - Redis connection issue!

Step 3: Check latency impact (Metrics)

opal
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name)
| filter service_name = "cartservice"
| make_col p95_ms:p95/1000000

Finding: p95 latency is elevated - errors are impacting response time.

Step 4: View hourly trend (Metrics)

opal
align 1h, errors:sum(m("span_error_count_5m"))
| aggregate errors_per_hour:sum(errors), group_by(service_name)
| filter service_name = "cartservice"

Finding: Errors started 3 hours ago - correlate with Redis deployment?

Result: Complete picture of issue (what, when, impact) using metrics + spans.

Dependency Tracking and Service Relationships

Understanding service dependencies is critical for root cause analysis. Spans contain relationship information through trace IDs and span types.

Key Span Fields for Dependency Tracking

Trace relationships:

  • trace_id - Links all spans in a single request
  • span_id - Unique identifier for this span
  • parent_span_id - Links to calling span

Span direction:

  • kind - "CLIENT" (outgoing call) or "SERVER" (incoming request)
  • span_type - "Service entry point", "Remote call", "Database call"

Target identification:

  • span_name - Operation name (e.g., "grpc.oteldemo.CartService/GetCart")
  • attributes.db.* - Database connection details

Pattern 10: What Services Does X Depend On? (Downstream)

Find all services that a given service calls:

opal
filter service_name = "frontend" and kind = "CLIENT"
| make_col target:span_name, dur_ms:duration / 1ms
| statsby calls:count(),
          p95_ms:percentile(dur_ms, 0.95),
          group_by(target)
| sort desc(calls)

Output: All downstream dependencies with call volume and latency.

Use case: "Frontend is slow - which downstream services are contributing?"

Pattern 11: What Services Depend On X? (Upstream)

Find all services that call a given service (requires trace traversal):

opal
filter span_type = "Service entry point"
| make_col svc:service_name
| statsby total:count(), group_by(svc)
| sort desc(total)

Output: Services ordered by request volume (entry points).

Note: Full upstream analysis requires joining spans by trace_id (advanced).

Pattern 12: Database Dependencies

Identify which services are making database calls:

opal
filter string(attributes."db.type") = "sql"
| make_col caller:service_name,
          db_name:string(attributes."db.name"),
          dur_ms:duration / 1ms
| statsby call_count:count(),
          p95_latency:percentile(dur_ms, 0.95),
          group_by(caller, db_name)
| sort desc(call_count)

Output: Service-to-database call patterns with latency.

Use case: "Are there any slow database queries impacting my services?"

Pattern 13: Service Call Patterns (All Outbound)

Get a complete picture of all outbound service calls:

opal
filter kind = "CLIENT"
| make_col caller:service_name, target:span_name
| statsby call_count:count(), group_by(caller, target)
| sort desc(call_count)
| limit 20

Output: Complete service-to-service call matrix.

Use case: Understanding service dependencies architecture-wide.

Dependency Troubleshooting Workflow

Scenario: Service X has elevated latency - is it a downstream dependency?

Step 1: Identify service's dependencies

opal
filter service_name = "X" and kind = "CLIENT"
| make_col target:span_name, dur_ms:duration / 1ms
| statsby calls:count(),
          p95_ms:percentile(dur_ms, 0.95),
          group_by(target)
| sort desc(p95_ms)

Step 2: Check if database calls are slow

opal
filter service_name = "X"
| filter string(attributes."db.type") = "sql"
| make_col db:string(attributes."db.name"),
          query:string(attributes."db.statement"),
          dur_ms:duration / 1ms
| statsby avg_ms:avg(dur_ms),
          p95_ms:percentile(dur_ms, 0.95),
          count:count(),
          group_by(db, query)
| sort desc(p95_ms)
| limit 10

Step 3: Compare to baseline Run same queries for different time range to see if latency changed.

Result: Identify which downstream dependency is causing the issue.

Common Use Cases

Use Case 1: Service Health Dashboard

Goal: Real-time overview of all services.

Solution: Use metrics with summary pattern:

opal
align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0

Add latency query separately (tdigest metrics).

Why metrics: Fast, efficient, updates every 5 minutes.

Use Case 2: SLO Tracking

Goal: Track p95 latency against 100ms SLO.

Solution: Use tdigest metrics:

opal
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name)
| make_col p95_ms:p95/1000000, slo_breach:if(p95/1000000 > 100, true, false)
| filter slo_breach = true

Output: Services exceeding 100ms p95 SLO.

Use Case 3: Error Investigation

Goal: Understand why a specific service is failing.

Solution: Start with metrics (count), drill down with spans (details):

opal
filter error = true
| filter service_name = "target-service"
| make_col error_msg:string(error_message),
          trace:trace_id,
          timestamp:start_time
| sort desc(timestamp)
| limit 50

Output: Recent errors with messages and trace IDs for further investigation.

Use Case 4: Capacity Planning

Goal: Understand traffic patterns over 30 days.

Solution: Use metrics with daily buckets:

opal
align 1d, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_day:sum(rate), group_by(service_name)

Time range: Set to 30d in query parameters.

Output: Daily request volume for capacity analysis.

Best Practices

  1. Start with metrics for dashboards, alerts, and overview queries
  2. Use spans for investigation when you need details or filtering
  3. Keep time ranges short when querying spans (1h recommended, max 24h)
  4. Use time-series patterns (align 1h) for trending charts
  5. Use summary patterns (options(bins: 1)) for single statistics
  6. Combine metrics (requests + errors) in single query for efficiency
  7. Filter spans early to improve performance (filter before make_col)
  8. Use fill to show services with zero errors in dashboards

Performance Considerations

Metrics queries:

  • Consistent performance across time ranges (1h same speed as 30d)
  • Ideal for: Dashboards, real-time monitoring, long time ranges
  • Limitation: Fixed dimensions, no custom filtering

Span queries:

  • Performance degrades with longer time ranges
  • Ideal for: Short time windows (1h), detailed investigation, filtered analysis
  • Limitation: Slower for high-volume services over long periods

Recommended approach:

  • Dashboards: 100% metrics
  • Alerts: Metrics
  • Investigation: Metrics (overview) → Spans (details)
  • Trace analysis: Spans only

Common Pitfalls

Pitfall 1: Using Spans for Long Time Ranges

Wrong:

opal
make_col svc:service_name
| statsby request_count:count(), group_by(svc)

With 30-day time range - very slow!

Correct:

opal
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)

Metrics handle 30 days efficiently.

Pitfall 2: Forgetting Time Units

Wrong:

opal
make_col dur_ms:float64(duration)/1000000

Correct:

opal
make_col dur_ms:duration / 1ms

Use OPAL duration units (see working-with-intervals skill).

Pitfall 3: Not Using fill for Zeros

Wrong:

opal
aggregate total_errors:sum(errors), group_by(service_name)

Services with zero errors won't appear.

Correct:

opal
aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0

All services shown (important for dashboards).

Pitfall 4: Wrong Metric for Latency

Wrong:

opal
align options(bins: 1), dur:avg(m("span_duration_5m"))

Don't use avg() on pre-aggregated duration.

Correct:

opal
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_*_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Use tdigest metrics for percentiles.

Related Skills

  • working-with-intervals - Understanding span duration and temporal queries
  • aggregating-gauge-metrics - Request and error count metrics (gauge type)
  • analyzing-tdigest-metrics - Latency percentile metrics (tdigest type)
  • filtering-event-datasets - Filtering techniques applicable to spans
  • time-series-analysis - Time-bucketed analysis with timechart (alternative to align)

Summary

APM in Observe uses the RED methodology:

  • Rate: Request counts from span_call_count_5m metric
  • Errors: Error counts from span_error_count_5m metric
  • Duration: Latency percentiles from span_*_duration_tdigest_5m metric

Key decision: Metrics for speed and overview, Spans for detail and investigation.

Workflow: Metrics (identify issues) → Spans (root cause analysis) → Traces (follow request flow).

Performance: Metrics scale to long time ranges, spans are best for short windows.


Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (OpenTelemetry/Span + ServiceExplorer Metrics)

Expand your agent's capabilities with these related and highly-rated skills.

rustomax/observe-community-mcp

working-with-reference-tables

Work with Reference Tables (static CSV lookup data) using OPAL to enrich datasets with descriptive information. Use when you need to map IDs to human-readable names, add static metadata from CSV uploads, or perform lookups without temporal considerations. Covers both explicit and implicit lookup patterns, column name matching, and when to choose Reference Tables vs Resources vs Correlation Tags.

1 1
Explore
rustomax/observe-community-mcp

working-with-resources

Work with Resource datasets (mutable state tracking) using OPAL temporal joins. Use when you need to enrich Events/Intervals with contextual state information, track resource state changes over time, or navigate between datasets using temporal relationships. Covers temporal join mechanics (lookup, join, follow), automatic field matching, and when to use Resources vs Reference Tables.

1 1
Explore
rustomax/observe-community-mcp

analyzing-text-patterns

Extract and analyze recurring patterns from log messages, span names, and event names using punctuation-based template discovery. Use when you need to understand log diversity, identify common message structures, detect unusual formats, or prepare for log parser development. Works by removing variable content and preserving structural markers.

1 1
Explore
rustomax/observe-community-mcp

time-series-analysis

Analyze event datasets (logs) and intervals over time using OPAL timechart. Use when you need to visualize trends, track metrics over time, or create time-series charts. Covers timechart for temporal binning, bin duration options (1h, 5m, 1d), options(bins:N) for controlling bin count, and understanding temporal output columns (_c_valid_from, _c_valid_to, _c_bucket). Returns multiple rows per group for time-series visualization. For single summaries, see aggregating-event-datasets skill.

1 1
Explore
rustomax/observe-community-mcp

detecting-anomalies

Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type.

1 1
Explore
rustomax/observe-community-mcp

aggregating-gauge-metrics

Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.

1 1
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results