Agent skill

investigating-textual-data

Stars 1
Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/rustomax/observe-community-mcp/tree/main/skills/investigating-textual-data

SKILL.md

Investigating Textual Data in Event Datasets

Investigate and analyze textual data in logs, span events, and other event datasets using OPAL filtering and pattern matching. Use when analyzing error messages, searching log patterns, troubleshooting application issues, or finding specific events in textual data. Covers discovering textual datasets, error detection patterns, regex matching, wide net filtering strategies, aggregation with sampling, and context extraction from nested fields.

For pre-aggregated metrics, see aggregating-gauge-metrics skill. For distributed tracing analysis, see analyzing-apm-data skill. For time-series trends, see time-series-analysis skill.


Table of Contents

  1. When to Use This Skill
  2. Quick Reference
  3. Understanding Textual Datasets
  4. Discovery Workflow
  5. Error Detection Patterns
  6. Text Search vs Regex Matching
  7. Wide Net Filtering Strategy
  8. Aggregation and Sampling
  9. Context Extraction
  10. Complete Examples
  11. Common Pitfalls
  12. Cross-References

When to Use This Skill

Use this skill when users ask questions like:

Error Analysis:

  • "Show me errors in Kubernetes logs"
  • "What are the top 10 error types in my containers?"
  • "Find all Redis connection errors"
  • "Which namespaces have the most errors?"

Pattern Search:

  • "Search logs for timeout messages"
  • "Find all database connection failures"
  • "Show me warnings in stderr"
  • "Get recent errors from CloudWatch logs"

Troubleshooting:

  • "Are there any errors related to authentication?"
  • "Show me error trends over the last 24 hours"
  • "Find all exceptions in the frontend service"
  • "What errors happened in the last hour?"

When NOT to use this skill:

  • Metrics queries (error counts from metrics) → Use aggregating-gauge-metrics
  • APM/tracing analysis (spans, traces) → Use analyzing-apm-data
  • Time-series trending → Use time-series-analysis
  • Simple filtering (known field values) → Use filtering-event-datasets

Quick Reference

Error Detection Patterns

Pattern OPAL Query Use Case
Stream filtering filter stream = "stderr" Container stderr logs
Text search filter contains(body, "error") Exact substring (case-sensitive)
Case-insensitive regex filter body ~ /error/i Flexible error matching
Multiple patterns filter body ~ /error|exception|failed/i Wide net approach
Wide net filter body ~ /error/i or stream = "stderr" Multiple conditions
Recent errors filter ... | sort desc(timestamp) | limit 20 Latest events

Common Field Names by Dataset Type

Dataset Type Message Field Severity Field Context Fields
K8s Logs body stream namespace, pod, container
CloudWatch message level logGroup, logStream
Spans error_message error (bool) service_name, span_name
Span Events event_name N/A trace_id, span_id

Critical: Always inspect dataset schema first to identify correct field names!


Understanding Textual Datasets

What Are Textual Event Datasets?

Event datasets contain point-in-time log entries with text messages. Each event has:

  • Single timestamp (not a duration)
  • Text field with log message (body, message, log, etc.)
  • Severity indicators (level, stream, severity)
  • Context fields (service, namespace, pod, container)

Common Dataset Types

1. Container Logs (Kubernetes, Docker):

  • Interface: log
  • Message field: body
  • Severity: stream ("stdout", "stderr")
  • Context: Nested resource_attributes.k8s.*

2. Cloud Provider Logs (CloudWatch, Stackdriver):

  • Interface: log
  • Message field: message or log
  • Severity: level or severity
  • Context: logGroup, logStream, resource.*

3. Span Events (OpenTelemetry):

  • Interface: log
  • Message field: event_name
  • Context: trace_id, span_id, service_name

4. Application Logs (Custom):

  • Interface: log
  • Varies by implementation

Key Difference from Metrics

Aspect Event Datasets (Logs) Metrics
Query approach filterstatsby alignaggregate
Data granularity Individual log entries Pre-aggregated values
Best for Detailed investigation, text search Volume trends, counts
Performance Slower for large volumes Fast, optimized

Rule: Use metrics for volume/trends, use logs for detailed investigation.


Discovery Workflow

Step 1: Identify User Intent

Listen for dataset hints:

  • "kubernetes logs" → K8s container logs
  • "cloudwatch" → AWS CloudWatch logs
  • "stderr" → Container error stream
  • "application logs" → Generic logs
  • "span events" → OpenTelemetry events

No specific hint? Use discovery to find textual datasets.

Step 2: Discover Textual Datasets

python
# General search
discover_context("logs")

# Specific search
discover_context("kubernetes logs")
discover_context("cloudwatch")
discover_context("application errors")

# Filter by interface type
discover_context("", interface_filter="log")

Look for:

  • Interface: log (event datasets with text)
  • Category: "Logs", "Events"
  • Dataset names with "Log", "Event", "CloudWatch", "K8s"

Step 3: Get Detailed Schema

CRITICAL: Always get field names before writing queries!

python
# Get complete field list
discover_context(dataset_id="42161740")

# Identify:
# 1. Message field: body, message, log, event_name
# 2. Severity field: stream, level, severity
# 3. Context fields: namespace, pod, service_name, etc.

Step 4: Check Field Samples

Pay attention to:

  • Field type: text, string, keyword
  • Sample values: See actual field content
  • Nested fields: resource_attributes.*, attributes.*

Example schema output:

body (text) - Sample: "Error: connection timeout to redis:6379"
stream (string) - Sample: "stderr"
namespace (string) - Sample: "default"
resource_attributes (object) - Nested fields:
  - k8s.namespace.name
  - k8s.pod.name

Error Detection Patterns

Pattern 1: Stream Filtering (Container Logs)

When to use: Kubernetes/Docker logs with stream field

Assumption: Container errors typically written to stderr

opal
filter stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          pod, container
| statsby error_count:count(), group_by(namespace, pod, container)
| sort desc(error_count)
| limit 20

Result: Error volume by container (1h):

namespace          pod                              container                error_count
default            opentelemetry-collector-xyz      otel-collector          1422
default            recommendationservice-abc        server                  118
observe            cluster-metrics-def              metrics-agent           60

Use case: "Which containers are generating the most stderr output?"

Limitation: Not all errors go to stderr - some apps write errors to stdout


Pattern 2: Text Search with contains()

When to use: Exact substring matching (case-sensitive)

opal
filter contains(body, "error") or contains(body, "ERROR") or contains(body, "Error")
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          error_snippet:body
| sort desc(timestamp)
| limit 20

Result: Recent errors with exact text match

Pros:

  • Simple syntax
  • Fast for exact matches

Cons:

  • Case-sensitive (must check "error", "ERROR", "Error")
  • No pattern flexibility

When to use instead of regex: Known exact string, simple search


Pattern 3: Case-Insensitive Regex

When to use: Flexible error matching with case variations

CRITICAL SYNTAX: Use /pattern/i with forward slashes (NOT string quotes)

opal
filter body ~ /error/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)
| limit 20

Result (1h):

namespace          container                error_count
observe            cluster-metrics          59
default            prometheus-server        17
kube-system        calico-node             4

Regex patterns:

  • /error/i - Matches "error", "ERROR", "Error"
  • /error|exception|failed/i - Alternation (OR)
  • /[Ee]rror/ - Character class (case-sensitive)
  • /timeout.*error/i - Sequence matching

Syntax rules:

  • CORRECT: body ~ /pattern/i
  • WRONG: body ~ "pattern" (string literal, not regex)
  • WRONG: body ~ "(?i)pattern" (PCRE not supported)

Pattern 4: Multiple Error Patterns (Wide Net)

When to use: Catch different error expressions

opal
filter body ~ /error|exception|failed|failure/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          container
| statsby count:count(), group_by(namespace, container)
| sort desc(count)

Result: Catches more errors than single pattern

  • "error" - Standard error messages
  • "exception" - Java, Python exceptions
  • "failed" - Command/operation failures
  • "failure" - Alternative phrasing

Regex alternation: Use | for OR matching


Pattern 5: Wide Net Strategy (Multiple Conditions)

When to use: Maximum error detection across different log formats

Principle: Combine text matching + severity fields + stream filtering

opal
filter body ~ /error|exception|failed/i
    or stream = "stderr"
    or level = "error"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          pod, container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)

Why this works:

  • body ~ /error/i - Catches text mentions
  • stream = "stderr" - Catches stderr output (might not have "error" in text)
  • level = "error" - Catches structured severity (CloudWatch, syslog)

Result (1h):

namespace          container                error_count    Source
default            opentelemetry-collector  1420           stderr (no "error" text)
observe            cluster-metrics          59             body matches /error/
default            prometheus-server        17             body matches /error/

Best practice: Always cast a wide net for error detection!


Pattern 6: Recent Errors with Details

When to use: Troubleshooting recent issues, seeing actual error messages

opal
filter body ~ /error/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          pod,
          container,
          error_msg:body,
          error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 20

Result: Latest 20 errors with full context and messages

Use case: "What are the most recent errors?"

Note: format_time() for human-readable timestamps (display only)


Text Search vs Regex Matching

Text Fields vs String Fields

Text fields (like body, message, log):

  • Unstructured text content
  • Use contains() for exact substring
  • Use ~ /pattern/ for regex

String fields (like stream, level, namespace):

  • Structured categorical values
  • Use =, != for exact match
  • Use ~ /pattern/ for regex

When to Use Each

Scenario Approach Example
Exact substring contains() contains(body, "timeout")
Case-insensitive Regex with /i body ~ /timeout/i
Pattern matching Regex body ~ /error[0-9]+/i
Multiple patterns Regex alternation body ~ /error|exception/i
Exact field value Equality stream = "stderr"

OPAL Regex Syntax Reference

CRITICAL: OPAL uses POSIX ERE (Extended Regular Expressions), NOT PCRE

Correct syntax:

opal
body ~ /pattern/         # Case-sensitive regex
body ~ /pattern/i        # Case-insensitive (i flag)
body ~ /error|exception/ # Alternation (OR)
body ~ /[Ee]rror/        # Character class
body ~ /timeout.*error/  # Sequence

Incorrect syntax (will fail or do literal string match):

opal
body ~ "pattern"              # String literal, NOT regex
body ~ "(?i)pattern"          # PCRE inline modifiers not supported
body ~ "error[0-9]+"          # String matching, not regex

Common regex patterns:

  • . - Any character
  • * - Zero or more
  • + - One or more
  • ? - Zero or one
  • [abc] - Character class
  • [a-z] - Range
  • | - Alternation (OR)
  • ^ - Start of line
  • $ - End of line

Wide Net Filtering Strategy

Why Wide Net Matters

Problem: Different logs express errors differently

  • Some use "error" in text
  • Some write to stderr (no "error" keyword)
  • Some use structured level field
  • Some use "exception", "failed", "failure"

Solution: Combine multiple conditions to catch all variations

Wide Net Template

opal
filter <text_patterns> or <severity_field> or <stream_field>
| make_col <context_fields>
| statsby count(), group_by(<group_fields>)

Example 1: Kubernetes Logs

opal
filter body ~ /error|exception|failed|failure/i
    or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)

Catches:

  • Body text: "Error connecting to database"
  • Stderr: Container crashes (no "error" in text)

Example 2: CloudWatch Logs

opal
filter message ~ /error|exception/i
    or level = "ERROR"
    or level = "FATAL"
| make_col logGroup, logStream
| statsby error_count:count(), group_by(logGroup, logStream)
| sort desc(error_count)

Catches:

  • Message text: "Connection error"
  • Structured level: {"level": "ERROR", "message": "..."}

Example 3: Application Logs (Generic)

opal
filter body ~ /error|exception|failed|timeout|refused/i
    or stream = "stderr"
    or level = "error"
    or severity = "ERROR"
| make_col service:string(resource_attributes."service.name"),
          msg:body
| statsby count:count(), sample:any(msg), group_by(service)
| sort desc(count)

Principle: Check all possible error indicators in the dataset


Aggregation and Sampling

Pattern 7: Error Counts by Group

When to use: "Which namespaces/services have most errors?"

opal
filter body ~ /error/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)
| limit 20

Result: Top 20 error sources

Aggregation: statsby count() counts all matching events per group


Pattern 8: Error Counts with Sample Messages

When to use: "Show me top errors WITH example messages"

Critical function: any() - Returns one sample value from the group

opal
filter body ~ /error/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          container,
          error_snippet:body
| statsby top_errors:count(),
          sample_msg:any(error_snippet),
          group_by(namespace, container)
| sort desc(top_errors)
| limit 10

Result (1h):

namespace   container              top_errors  sample_msg
default     prometheus-server      17          "Error translating OTLP metrics to Prometheus write request"
kube-system calico-node           4           "Watch error received from Upstream"
default     frontend              2           "Error: 8 RESOURCE_EXHAUSTED"
default     cartservice           1           "Can't access cart storage... connect to redis"

Use case: See error counts AND understand what the errors look like

Why any(): Provides context without listing all error messages


Pattern 9: Error Trends Over Time

When to use: "Show me error trends in the last 24 hours"

Use timechart for time-series aggregation (NOT statsby)

opal
filter body ~ /error/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| timechart 1h, error_count:count(), group_by(namespace)

Result: Time-series data (multiple rows per namespace)

_c_bucket    namespace    error_count
2025-11-14T00:00:00Z    default    145
2025-11-14T01:00:00Z    default    132
2025-11-14T02:00:00Z    default    89
...

Output includes:

  • _c_bucket - Time bucket
  • _c_valid_from, _c_valid_to - Bucket boundaries
  • One row per (namespace, time_bucket)

For trending: See time-series-analysis skill


Pattern 10: Targeted Component Search

When to use: "Find all Redis connection errors"

Use specific regex targeting component names

opal
filter body ~ /redis.*error|connection.*redis|redis.*timeout/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          pod,
          error_msg:body,
          error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 20

Result: Only Redis-related errors

Common targeted searches:

  • Database: /postgres|mysql|database.*error/i
  • Network: /timeout|connection.*refused|network.*error/i
  • Authentication: /auth.*failed|unauthorized|403|401/i
  • Resources: /out of memory|resource exhausted|disk full/i

Context Extraction

Handling Nested Fields

Problem: Kubernetes metadata often nested in resource_attributes.*

Correct syntax: Quote fields with dots

opal
# CORRECT
make_col namespace:string(resource_attributes."k8s.namespace.name"),
         pod:string(resource_attributes."k8s.pod.name"),
         node:string(resource_attributes."k8s.node.name")

# WRONG (will fail)
make_col namespace:resource_attributes.k8s.namespace.name

Rule: object."field.with.dots" - quote only the field name

Common Nested Field Patterns

Kubernetes:

opal
resource_attributes."k8s.namespace.name"
resource_attributes."k8s.pod.name"
resource_attributes."k8s.container.name"
resource_attributes."k8s.node.name"
resource_attributes."k8s.deployment.name"

OpenTelemetry:

opal
resource_attributes."service.name"
resource_attributes."service.version"
resource_attributes."deployment.environment"
attributes."http.status_code"
attributes."db.name"

CloudWatch:

opal
resource."aws.region"
resource."aws.account_id"

Extracting Context Fields

Template:

opal
make_col service:string(resource_attributes."service.name"),
         namespace:string(resource_attributes."k8s.namespace.name"),
         pod:string(resource_attributes."k8s.pod.name"),
         container:container,                    # Top-level field
         error_msg:body

Type casting: Use string() for nested variant/object fields


Complete Examples

Example 1: Top 10 Error Types in K8s Logs

User question: "Give me top 10 error types in Kubernetes container logs. Tell me which namespaces have most errors."

Step 1: Discovery

python
discover_context("kubernetes logs")
# Result: Kubernetes Explorer/Kubernetes Logs (ID: 42161740)

discover_context(dataset_id="42161740")
# Fields: body (text), stream (string), namespace, pod, container
# Nested: resource_attributes.k8s.*

Step 2: Query

opal
filter body ~ /error|exception|failed/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
          container,
          error_snippet:body
| statsby error_count:count(),
          sample_error:any(error_snippet),
          group_by(namespace, container)
| sort desc(error_count)
| limit 10

Step 3: Result

namespace   container                error_count  sample_error
default     opentelemetry-collector  1420         "Exporting failed..."
default     recommendationservice    118          "gRPC connection timeout"
observe     cluster-metrics          59           "Error translating metrics"
...

Explanation:

  • Wide net filter catches text + stderr
  • Extract namespace from nested field
  • Count + sample message per group
  • Sort by volume, limit to top 10

Example 2: Recent Errors in Production Namespace

User question: "Show me recent errors from the production namespace in the last hour"

Step 1: Query

opal
filter body ~ /error|exception/i or stream = "stderr"
| filter string(resource_attributes."k8s.namespace.name") = "production"
| make_col pod:string(resource_attributes."k8s.pod.name"),
          container,
          error_msg:body,
          error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 20

Step 2: Result

pod                        container    error_time           error_msg
frontend-abc123           server       2025-11-14 17:45:32  "Error: RESOURCE_EXHAUSTED"
cartservice-def456        cart         2025-11-14 17:42:18  "Can't connect to redis:6379"
...

Explanation:

  • Wide net filter for errors
  • Second filter for specific namespace
  • Format timestamp for readability
  • Sort by time (most recent first)

Example 3: Database Connection Errors

User question: "Find all database connection errors across all services"

Step 1: Query

opal
filter body ~ /database.*error|db.*connection|postgres.*error|mysql.*failed/i
| make_col service:string(resource_attributes."service.name"),
          namespace:string(resource_attributes."k8s.namespace.name"),
          error_msg:body
| statsby error_count:count(),
          sample:any(error_msg),
          group_by(service, namespace)
| sort desc(error_count)

Step 2: Result

service          namespace   error_count  sample
payment-service  production  24           "PostgreSQL connection timeout to db:5432"
user-service     production  12           "MySQL error: Too many connections"
...

Explanation:

  • Targeted regex for database-related errors
  • Extract service and namespace context
  • Count + sample per service
  • Identify which services have DB issues

Example 4: Error Volume Comparison

User question: "Compare error volumes between production and staging namespaces over the last 24 hours"

Step 1: Query

opal
filter body ~ /error|exception|failed/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| filter namespace = "production" or namespace = "staging"
| timechart 1h, error_count:count(), group_by(namespace)

Step 2: Result (time-series)

_c_bucket              namespace    error_count
2025-11-13T18:00:00Z  production   342
2025-11-13T18:00:00Z  staging      89
2025-11-13T19:00:00Z  production   298
2025-11-13T19:00:00Z  staging      102
...

Explanation:

  • Wide net error detection
  • Filter to specific namespaces
  • Time-series aggregation (hourly buckets)
  • Compare error trends visually

Example 5: Authentication Failures

User question: "Show me all authentication failures in the last hour"

Step 1: Query

opal
filter body ~ /auth.*failed|unauthorized|403|401|authentication.*error/i
| make_col service:string(resource_attributes."service.name"),
          namespace:string(resource_attributes."k8s.namespace.name"),
          pod:string(resource_attributes."k8s.pod.name"),
          error_msg:body,
          error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 50

Step 2: Result

service        namespace   pod              error_time           error_msg
api-gateway    production  api-gw-abc123   2025-11-14 17:52:14  "HTTP 401: Unauthorized"
auth-service   production  auth-xyz789     2025-11-14 17:48:32  "Auth failed: invalid token"
...

Explanation:

  • Targeted regex for auth-related errors
  • Full context extraction
  • Recent errors first
  • Higher limit (50) to catch patterns

Common Pitfalls

Pitfall 1: Using String Quotes for Regex

WRONG:

opal
filter body ~ "error"           # String literal matching
filter body ~ "(?i)error"       # PCRE syntax not supported
filter body ~ "error|exception" # String matching, not regex alternation

CORRECT:

opal
filter body ~ /error/           # Regex (case-sensitive)
filter body ~ /error/i          # Regex (case-insensitive)
filter body ~ /error|exception/ # Regex alternation

Symptom: Query returns 0 results or unexpected matches

Fix: Use forward slashes /pattern/ for regex, NOT string quotes


Pitfall 2: Not Quoting Nested Fields

WRONG:

opal
make_col namespace:resource_attributes.k8s.namespace.name

CORRECT:

opal
make_col namespace:string(resource_attributes."k8s.namespace.name")

Symptom: "Field not found" error

Fix: Quote field names with dots: object."field.with.dots"


Pitfall 3: Assuming Field Names

WRONG:

opal
# Assuming all logs have "message" field
filter message ~ /error/i

CORRECT:

opal
# Check schema first!
# K8s logs use "body", CloudWatch uses "message"
discover_context(dataset_id="...")
# Then use correct field
filter body ~ /error/i

Symptom: "Column not found: message" error

Fix: ALWAYS run discover_context(dataset_id="...") to get exact field names


Pitfall 4: Case-Sensitive Text Search

WRONG:

opal
filter contains(body, "error")  # Misses "Error", "ERROR"

CORRECT:

opal
# Option 1: Multiple contains
filter contains(body, "error") or contains(body, "ERROR") or contains(body, "Error")

# Option 2: Regex (better)
filter body ~ /error/i

Symptom: Missing errors that use different capitalization

Fix: Use regex with /i flag for case-insensitive matching


Pitfall 5: Narrow Error Detection

WRONG:

opal
filter stream = "stderr"  # Misses stdout errors

CORRECT:

opal
filter body ~ /error|exception|failed/i
    or stream = "stderr"

Symptom: Missing errors that don't match single condition

Fix: Use wide net strategy - combine multiple error indicators


Pitfall 6: Using statsby for Time-Series

WRONG:

opal
# Trying to get hourly trends
filter body ~ /error/i
| statsby count(), group_by(namespace)  # Returns ONE row per namespace

CORRECT:

opal
# Use timechart for time-series
filter body ~ /error/i
| timechart 1h, count(), group_by(namespace)  # Returns multiple rows (time buckets)

Symptom: Getting summary instead of trends

Fix: Use timechart for time-series, statsby for single summary


Pitfall 7: Forgetting Type Casting

WRONG:

opal
make_col namespace:resource_attributes."k8s.namespace.name"
# Might be variant type, causes issues in group_by

CORRECT:

opal
make_col namespace:string(resource_attributes."k8s.namespace.name")

Symptom: Aggregation errors or unexpected grouping

Fix: Cast nested fields to expected type: string(), int64(), etc.


Cross-References

Related Skills

filtering-event-datasets:

  • Basic filtering syntax (contains(), ~, comparison operators)
  • When to use filter vs aggregation
  • Use for: Simple known-value filtering

aggregating-event-datasets:

  • statsby for aggregations
  • make_col for derived columns
  • Aggregation functions (count(), sum(), any())
  • Use for: Counting, grouping, summarizing

time-series-analysis:

  • timechart for temporal trending
  • Time bucket configuration
  • Use for: Error trends over time

analyzing-apm-data:

  • Span-based error analysis (error field, error_message)
  • Service-level error tracking
  • Use for: APM/tracing error investigation

aggregating-gauge-metrics:

  • Error count metrics (error_count_5m)
  • Volume trending with metrics
  • Use for: High-level error volume (fast)

When to Use Which Skill

User Question Skill to Use Why
"Show me errors in K8s logs" investigating-textual-data Text search in logs
"What's the error rate for my service?" analyzing-apm-data APM metrics
"Count errors by service (metrics)" aggregating-gauge-metrics Pre-aggregated metrics
"Show error trends over 24h" time-series-analysis Time-series aggregation
"Filter logs for namespace=production" filtering-event-datasets Simple filtering
"Count errors by container" aggregating-event-datasets Event aggregation

Decision Matrix

User asks about errors
        |
        v
    From what source?
        |
    +---+---+---+
    |   |   |   |
Logs Spans Metrics No source specified
    |   |   |   |
    |   |   |   v
    |   |   |   Discover textual datasets
    |   |   |   (kubernetes logs, cloudwatch, etc.)
    |   |   |   |
    v   v   v   v
    |   |   |
Textual APM  Volume
investigation     |
    |   |        v
    |   |   aggregating-gauge-metrics
    |   |   (error_count_5m)
    |   |
    |   v
    | analyzing-apm-data
    | (spans, error field)
    |
    v
investigating-textual-data
(logs, events, regex, wide net)

Summary

Key Takeaways:

  1. Always discover schema first - Field names vary by dataset
  2. Use regex with /pattern/i - Forward slashes, NOT string quotes
  3. Cast wide net - Combine text patterns + severity + stream
  4. Quote nested fields - resource_attributes."k8s.namespace.name"
  5. Sample with any() - Get error counts WITH example messages
  6. Metrics vs logs - Metrics for volume, logs for details
  7. statsby vs timechart - Summary vs time-series

Common workflow:

1. discover_context("user intent keywords")
2. discover_context(dataset_id="...")  # Get schema
3. Write wide net filter (regex + severity + stream)
4. Extract context (namespace, service, pod)
5. Aggregate with samples (count + any())
6. Sort and limit results

For more:

  • OPAL syntax → filtering-event-datasets
  • Aggregations → aggregating-event-datasets
  • Trends → time-series-analysis
  • APM errors → analyzing-apm-data
  • Pattern discovery → analyzing-text-patterns

Expand your agent's capabilities with these related and highly-rated skills.

rustomax/observe-community-mcp

working-with-reference-tables

Work with Reference Tables (static CSV lookup data) using OPAL to enrich datasets with descriptive information. Use when you need to map IDs to human-readable names, add static metadata from CSV uploads, or perform lookups without temporal considerations. Covers both explicit and implicit lookup patterns, column name matching, and when to choose Reference Tables vs Resources vs Correlation Tags.

1 1
Explore
rustomax/observe-community-mcp

working-with-resources

Work with Resource datasets (mutable state tracking) using OPAL temporal joins. Use when you need to enrich Events/Intervals with contextual state information, track resource state changes over time, or navigate between datasets using temporal relationships. Covers temporal join mechanics (lookup, join, follow), automatic field matching, and when to use Resources vs Reference Tables.

1 1
Explore
rustomax/observe-community-mcp

analyzing-text-patterns

Extract and analyze recurring patterns from log messages, span names, and event names using punctuation-based template discovery. Use when you need to understand log diversity, identify common message structures, detect unusual formats, or prepare for log parser development. Works by removing variable content and preserving structural markers.

1 1
Explore
rustomax/observe-community-mcp

time-series-analysis

Analyze event datasets (logs) and intervals over time using OPAL timechart. Use when you need to visualize trends, track metrics over time, or create time-series charts. Covers timechart for temporal binning, bin duration options (1h, 5m, 1d), options(bins:N) for controlling bin count, and understanding temporal output columns (_c_valid_from, _c_valid_to, _c_bucket). Returns multiple rows per group for time-series visualization. For single summaries, see aggregating-event-datasets skill.

1 1
Explore
rustomax/observe-community-mcp

detecting-anomalies

Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type.

1 1
Explore
rustomax/observe-community-mcp

aggregating-gauge-metrics

Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.

1 1
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results