Agent skill
investigating-textual-data
Install this agent skill to your Project
npx add-skill https://github.com/rustomax/observe-community-mcp/tree/main/skills/investigating-textual-data
SKILL.md
Investigating Textual Data in Event Datasets
Investigate and analyze textual data in logs, span events, and other event datasets using OPAL filtering and pattern matching. Use when analyzing error messages, searching log patterns, troubleshooting application issues, or finding specific events in textual data. Covers discovering textual datasets, error detection patterns, regex matching, wide net filtering strategies, aggregation with sampling, and context extraction from nested fields.
For pre-aggregated metrics, see aggregating-gauge-metrics skill. For distributed tracing analysis, see analyzing-apm-data skill. For time-series trends, see time-series-analysis skill.
Table of Contents
- When to Use This Skill
- Quick Reference
- Understanding Textual Datasets
- Discovery Workflow
- Error Detection Patterns
- Text Search vs Regex Matching
- Wide Net Filtering Strategy
- Aggregation and Sampling
- Context Extraction
- Complete Examples
- Common Pitfalls
- Cross-References
When to Use This Skill
Use this skill when users ask questions like:
Error Analysis:
- "Show me errors in Kubernetes logs"
- "What are the top 10 error types in my containers?"
- "Find all Redis connection errors"
- "Which namespaces have the most errors?"
Pattern Search:
- "Search logs for timeout messages"
- "Find all database connection failures"
- "Show me warnings in stderr"
- "Get recent errors from CloudWatch logs"
Troubleshooting:
- "Are there any errors related to authentication?"
- "Show me error trends over the last 24 hours"
- "Find all exceptions in the frontend service"
- "What errors happened in the last hour?"
When NOT to use this skill:
- Metrics queries (error counts from metrics) → Use aggregating-gauge-metrics
- APM/tracing analysis (spans, traces) → Use analyzing-apm-data
- Time-series trending → Use time-series-analysis
- Simple filtering (known field values) → Use filtering-event-datasets
Quick Reference
Error Detection Patterns
| Pattern | OPAL Query | Use Case |
|---|---|---|
| Stream filtering | filter stream = "stderr" |
Container stderr logs |
| Text search | filter contains(body, "error") |
Exact substring (case-sensitive) |
| Case-insensitive regex | filter body ~ /error/i |
Flexible error matching |
| Multiple patterns | filter body ~ /error|exception|failed/i |
Wide net approach |
| Wide net | filter body ~ /error/i or stream = "stderr" |
Multiple conditions |
| Recent errors | filter ... | sort desc(timestamp) | limit 20 |
Latest events |
Common Field Names by Dataset Type
| Dataset Type | Message Field | Severity Field | Context Fields |
|---|---|---|---|
| K8s Logs | body |
stream |
namespace, pod, container |
| CloudWatch | message |
level |
logGroup, logStream |
| Spans | error_message |
error (bool) |
service_name, span_name |
| Span Events | event_name |
N/A | trace_id, span_id |
Critical: Always inspect dataset schema first to identify correct field names!
Understanding Textual Datasets
What Are Textual Event Datasets?
Event datasets contain point-in-time log entries with text messages. Each event has:
- Single timestamp (not a duration)
- Text field with log message (
body,message,log, etc.) - Severity indicators (
level,stream,severity) - Context fields (service, namespace, pod, container)
Common Dataset Types
1. Container Logs (Kubernetes, Docker):
- Interface:
log - Message field:
body - Severity:
stream("stdout", "stderr") - Context: Nested
resource_attributes.k8s.*
2. Cloud Provider Logs (CloudWatch, Stackdriver):
- Interface:
log - Message field:
messageorlog - Severity:
levelorseverity - Context:
logGroup,logStream,resource.*
3. Span Events (OpenTelemetry):
- Interface:
log - Message field:
event_name - Context:
trace_id,span_id,service_name
4. Application Logs (Custom):
- Interface:
log - Varies by implementation
Key Difference from Metrics
| Aspect | Event Datasets (Logs) | Metrics |
|---|---|---|
| Query approach | filter → statsby |
align → aggregate |
| Data granularity | Individual log entries | Pre-aggregated values |
| Best for | Detailed investigation, text search | Volume trends, counts |
| Performance | Slower for large volumes | Fast, optimized |
Rule: Use metrics for volume/trends, use logs for detailed investigation.
Discovery Workflow
Step 1: Identify User Intent
Listen for dataset hints:
- "kubernetes logs" → K8s container logs
- "cloudwatch" → AWS CloudWatch logs
- "stderr" → Container error stream
- "application logs" → Generic logs
- "span events" → OpenTelemetry events
No specific hint? Use discovery to find textual datasets.
Step 2: Discover Textual Datasets
# General search
discover_context("logs")
# Specific search
discover_context("kubernetes logs")
discover_context("cloudwatch")
discover_context("application errors")
# Filter by interface type
discover_context("", interface_filter="log")
Look for:
- Interface:
log(event datasets with text) - Category: "Logs", "Events"
- Dataset names with "Log", "Event", "CloudWatch", "K8s"
Step 3: Get Detailed Schema
CRITICAL: Always get field names before writing queries!
# Get complete field list
discover_context(dataset_id="42161740")
# Identify:
# 1. Message field: body, message, log, event_name
# 2. Severity field: stream, level, severity
# 3. Context fields: namespace, pod, service_name, etc.
Step 4: Check Field Samples
Pay attention to:
- Field type:
text,string,keyword - Sample values: See actual field content
- Nested fields:
resource_attributes.*,attributes.*
Example schema output:
body (text) - Sample: "Error: connection timeout to redis:6379"
stream (string) - Sample: "stderr"
namespace (string) - Sample: "default"
resource_attributes (object) - Nested fields:
- k8s.namespace.name
- k8s.pod.name
Error Detection Patterns
Pattern 1: Stream Filtering (Container Logs)
When to use: Kubernetes/Docker logs with stream field
Assumption: Container errors typically written to stderr
filter stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
pod, container
| statsby error_count:count(), group_by(namespace, pod, container)
| sort desc(error_count)
| limit 20
Result: Error volume by container (1h):
namespace pod container error_count
default opentelemetry-collector-xyz otel-collector 1422
default recommendationservice-abc server 118
observe cluster-metrics-def metrics-agent 60
Use case: "Which containers are generating the most stderr output?"
Limitation: Not all errors go to stderr - some apps write errors to stdout
Pattern 2: Text Search with contains()
When to use: Exact substring matching (case-sensitive)
filter contains(body, "error") or contains(body, "ERROR") or contains(body, "Error")
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
error_snippet:body
| sort desc(timestamp)
| limit 20
Result: Recent errors with exact text match
Pros:
- Simple syntax
- Fast for exact matches
Cons:
- Case-sensitive (must check "error", "ERROR", "Error")
- No pattern flexibility
When to use instead of regex: Known exact string, simple search
Pattern 3: Case-Insensitive Regex
When to use: Flexible error matching with case variations
CRITICAL SYNTAX: Use /pattern/i with forward slashes (NOT string quotes)
filter body ~ /error/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)
| limit 20
Result (1h):
namespace container error_count
observe cluster-metrics 59
default prometheus-server 17
kube-system calico-node 4
Regex patterns:
/error/i- Matches "error", "ERROR", "Error"/error|exception|failed/i- Alternation (OR)/[Ee]rror/- Character class (case-sensitive)/timeout.*error/i- Sequence matching
Syntax rules:
- CORRECT:
body ~ /pattern/i - WRONG:
body ~ "pattern"(string literal, not regex) - WRONG:
body ~ "(?i)pattern"(PCRE not supported)
Pattern 4: Multiple Error Patterns (Wide Net)
When to use: Catch different error expressions
filter body ~ /error|exception|failed|failure/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
container
| statsby count:count(), group_by(namespace, container)
| sort desc(count)
Result: Catches more errors than single pattern
- "error" - Standard error messages
- "exception" - Java, Python exceptions
- "failed" - Command/operation failures
- "failure" - Alternative phrasing
Regex alternation: Use | for OR matching
Pattern 5: Wide Net Strategy (Multiple Conditions)
When to use: Maximum error detection across different log formats
Principle: Combine text matching + severity fields + stream filtering
filter body ~ /error|exception|failed/i
or stream = "stderr"
or level = "error"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
pod, container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)
Why this works:
body ~ /error/i- Catches text mentionsstream = "stderr"- Catches stderr output (might not have "error" in text)level = "error"- Catches structured severity (CloudWatch, syslog)
Result (1h):
namespace container error_count Source
default opentelemetry-collector 1420 stderr (no "error" text)
observe cluster-metrics 59 body matches /error/
default prometheus-server 17 body matches /error/
Best practice: Always cast a wide net for error detection!
Pattern 6: Recent Errors with Details
When to use: Troubleshooting recent issues, seeing actual error messages
filter body ~ /error/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
pod,
container,
error_msg:body,
error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 20
Result: Latest 20 errors with full context and messages
Use case: "What are the most recent errors?"
Note: format_time() for human-readable timestamps (display only)
Text Search vs Regex Matching
Text Fields vs String Fields
Text fields (like body, message, log):
- Unstructured text content
- Use
contains()for exact substring - Use
~ /pattern/for regex
String fields (like stream, level, namespace):
- Structured categorical values
- Use
=,!=for exact match - Use
~ /pattern/for regex
When to Use Each
| Scenario | Approach | Example |
|---|---|---|
| Exact substring | contains() |
contains(body, "timeout") |
| Case-insensitive | Regex with /i |
body ~ /timeout/i |
| Pattern matching | Regex | body ~ /error[0-9]+/i |
| Multiple patterns | Regex alternation | body ~ /error|exception/i |
| Exact field value | Equality | stream = "stderr" |
OPAL Regex Syntax Reference
CRITICAL: OPAL uses POSIX ERE (Extended Regular Expressions), NOT PCRE
Correct syntax:
body ~ /pattern/ # Case-sensitive regex
body ~ /pattern/i # Case-insensitive (i flag)
body ~ /error|exception/ # Alternation (OR)
body ~ /[Ee]rror/ # Character class
body ~ /timeout.*error/ # Sequence
Incorrect syntax (will fail or do literal string match):
body ~ "pattern" # String literal, NOT regex
body ~ "(?i)pattern" # PCRE inline modifiers not supported
body ~ "error[0-9]+" # String matching, not regex
Common regex patterns:
.- Any character*- Zero or more+- One or more?- Zero or one[abc]- Character class[a-z]- Range|- Alternation (OR)^- Start of line$- End of line
Wide Net Filtering Strategy
Why Wide Net Matters
Problem: Different logs express errors differently
- Some use "error" in text
- Some write to stderr (no "error" keyword)
- Some use structured
levelfield - Some use "exception", "failed", "failure"
Solution: Combine multiple conditions to catch all variations
Wide Net Template
filter <text_patterns> or <severity_field> or <stream_field>
| make_col <context_fields>
| statsby count(), group_by(<group_fields>)
Example 1: Kubernetes Logs
filter body ~ /error|exception|failed|failure/i
or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)
Catches:
- Body text: "Error connecting to database"
- Stderr: Container crashes (no "error" in text)
Example 2: CloudWatch Logs
filter message ~ /error|exception/i
or level = "ERROR"
or level = "FATAL"
| make_col logGroup, logStream
| statsby error_count:count(), group_by(logGroup, logStream)
| sort desc(error_count)
Catches:
- Message text: "Connection error"
- Structured level:
{"level": "ERROR", "message": "..."}
Example 3: Application Logs (Generic)
filter body ~ /error|exception|failed|timeout|refused/i
or stream = "stderr"
or level = "error"
or severity = "ERROR"
| make_col service:string(resource_attributes."service.name"),
msg:body
| statsby count:count(), sample:any(msg), group_by(service)
| sort desc(count)
Principle: Check all possible error indicators in the dataset
Aggregation and Sampling
Pattern 7: Error Counts by Group
When to use: "Which namespaces/services have most errors?"
filter body ~ /error/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
container
| statsby error_count:count(), group_by(namespace, container)
| sort desc(error_count)
| limit 20
Result: Top 20 error sources
Aggregation: statsby count() counts all matching events per group
Pattern 8: Error Counts with Sample Messages
When to use: "Show me top errors WITH example messages"
Critical function: any() - Returns one sample value from the group
filter body ~ /error/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
container,
error_snippet:body
| statsby top_errors:count(),
sample_msg:any(error_snippet),
group_by(namespace, container)
| sort desc(top_errors)
| limit 10
Result (1h):
namespace container top_errors sample_msg
default prometheus-server 17 "Error translating OTLP metrics to Prometheus write request"
kube-system calico-node 4 "Watch error received from Upstream"
default frontend 2 "Error: 8 RESOURCE_EXHAUSTED"
default cartservice 1 "Can't access cart storage... connect to redis"
Use case: See error counts AND understand what the errors look like
Why any(): Provides context without listing all error messages
Pattern 9: Error Trends Over Time
When to use: "Show me error trends in the last 24 hours"
Use timechart for time-series aggregation (NOT statsby)
filter body ~ /error/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| timechart 1h, error_count:count(), group_by(namespace)
Result: Time-series data (multiple rows per namespace)
_c_bucket namespace error_count
2025-11-14T00:00:00Z default 145
2025-11-14T01:00:00Z default 132
2025-11-14T02:00:00Z default 89
...
Output includes:
_c_bucket- Time bucket_c_valid_from,_c_valid_to- Bucket boundaries- One row per (namespace, time_bucket)
For trending: See time-series-analysis skill
Pattern 10: Targeted Component Search
When to use: "Find all Redis connection errors"
Use specific regex targeting component names
filter body ~ /redis.*error|connection.*redis|redis.*timeout/i
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
pod,
error_msg:body,
error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 20
Result: Only Redis-related errors
Common targeted searches:
- Database:
/postgres|mysql|database.*error/i - Network:
/timeout|connection.*refused|network.*error/i - Authentication:
/auth.*failed|unauthorized|403|401/i - Resources:
/out of memory|resource exhausted|disk full/i
Context Extraction
Handling Nested Fields
Problem: Kubernetes metadata often nested in resource_attributes.*
Correct syntax: Quote fields with dots
# CORRECT
make_col namespace:string(resource_attributes."k8s.namespace.name"),
pod:string(resource_attributes."k8s.pod.name"),
node:string(resource_attributes."k8s.node.name")
# WRONG (will fail)
make_col namespace:resource_attributes.k8s.namespace.name
Rule: object."field.with.dots" - quote only the field name
Common Nested Field Patterns
Kubernetes:
resource_attributes."k8s.namespace.name"
resource_attributes."k8s.pod.name"
resource_attributes."k8s.container.name"
resource_attributes."k8s.node.name"
resource_attributes."k8s.deployment.name"
OpenTelemetry:
resource_attributes."service.name"
resource_attributes."service.version"
resource_attributes."deployment.environment"
attributes."http.status_code"
attributes."db.name"
CloudWatch:
resource."aws.region"
resource."aws.account_id"
Extracting Context Fields
Template:
make_col service:string(resource_attributes."service.name"),
namespace:string(resource_attributes."k8s.namespace.name"),
pod:string(resource_attributes."k8s.pod.name"),
container:container, # Top-level field
error_msg:body
Type casting: Use string() for nested variant/object fields
Complete Examples
Example 1: Top 10 Error Types in K8s Logs
User question: "Give me top 10 error types in Kubernetes container logs. Tell me which namespaces have most errors."
Step 1: Discovery
discover_context("kubernetes logs")
# Result: Kubernetes Explorer/Kubernetes Logs (ID: 42161740)
discover_context(dataset_id="42161740")
# Fields: body (text), stream (string), namespace, pod, container
# Nested: resource_attributes.k8s.*
Step 2: Query
filter body ~ /error|exception|failed/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name"),
container,
error_snippet:body
| statsby error_count:count(),
sample_error:any(error_snippet),
group_by(namespace, container)
| sort desc(error_count)
| limit 10
Step 3: Result
namespace container error_count sample_error
default opentelemetry-collector 1420 "Exporting failed..."
default recommendationservice 118 "gRPC connection timeout"
observe cluster-metrics 59 "Error translating metrics"
...
Explanation:
- Wide net filter catches text + stderr
- Extract namespace from nested field
- Count + sample message per group
- Sort by volume, limit to top 10
Example 2: Recent Errors in Production Namespace
User question: "Show me recent errors from the production namespace in the last hour"
Step 1: Query
filter body ~ /error|exception/i or stream = "stderr"
| filter string(resource_attributes."k8s.namespace.name") = "production"
| make_col pod:string(resource_attributes."k8s.pod.name"),
container,
error_msg:body,
error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 20
Step 2: Result
pod container error_time error_msg
frontend-abc123 server 2025-11-14 17:45:32 "Error: RESOURCE_EXHAUSTED"
cartservice-def456 cart 2025-11-14 17:42:18 "Can't connect to redis:6379"
...
Explanation:
- Wide net filter for errors
- Second filter for specific namespace
- Format timestamp for readability
- Sort by time (most recent first)
Example 3: Database Connection Errors
User question: "Find all database connection errors across all services"
Step 1: Query
filter body ~ /database.*error|db.*connection|postgres.*error|mysql.*failed/i
| make_col service:string(resource_attributes."service.name"),
namespace:string(resource_attributes."k8s.namespace.name"),
error_msg:body
| statsby error_count:count(),
sample:any(error_msg),
group_by(service, namespace)
| sort desc(error_count)
Step 2: Result
service namespace error_count sample
payment-service production 24 "PostgreSQL connection timeout to db:5432"
user-service production 12 "MySQL error: Too many connections"
...
Explanation:
- Targeted regex for database-related errors
- Extract service and namespace context
- Count + sample per service
- Identify which services have DB issues
Example 4: Error Volume Comparison
User question: "Compare error volumes between production and staging namespaces over the last 24 hours"
Step 1: Query
filter body ~ /error|exception|failed/i or stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| filter namespace = "production" or namespace = "staging"
| timechart 1h, error_count:count(), group_by(namespace)
Step 2: Result (time-series)
_c_bucket namespace error_count
2025-11-13T18:00:00Z production 342
2025-11-13T18:00:00Z staging 89
2025-11-13T19:00:00Z production 298
2025-11-13T19:00:00Z staging 102
...
Explanation:
- Wide net error detection
- Filter to specific namespaces
- Time-series aggregation (hourly buckets)
- Compare error trends visually
Example 5: Authentication Failures
User question: "Show me all authentication failures in the last hour"
Step 1: Query
filter body ~ /auth.*failed|unauthorized|403|401|authentication.*error/i
| make_col service:string(resource_attributes."service.name"),
namespace:string(resource_attributes."k8s.namespace.name"),
pod:string(resource_attributes."k8s.pod.name"),
error_msg:body,
error_time:format_time(timestamp, 'YYYY-MM-DD HH24:MI:SS')
| sort desc(timestamp)
| limit 50
Step 2: Result
service namespace pod error_time error_msg
api-gateway production api-gw-abc123 2025-11-14 17:52:14 "HTTP 401: Unauthorized"
auth-service production auth-xyz789 2025-11-14 17:48:32 "Auth failed: invalid token"
...
Explanation:
- Targeted regex for auth-related errors
- Full context extraction
- Recent errors first
- Higher limit (50) to catch patterns
Common Pitfalls
Pitfall 1: Using String Quotes for Regex
WRONG:
filter body ~ "error" # String literal matching
filter body ~ "(?i)error" # PCRE syntax not supported
filter body ~ "error|exception" # String matching, not regex alternation
CORRECT:
filter body ~ /error/ # Regex (case-sensitive)
filter body ~ /error/i # Regex (case-insensitive)
filter body ~ /error|exception/ # Regex alternation
Symptom: Query returns 0 results or unexpected matches
Fix: Use forward slashes /pattern/ for regex, NOT string quotes
Pitfall 2: Not Quoting Nested Fields
WRONG:
make_col namespace:resource_attributes.k8s.namespace.name
CORRECT:
make_col namespace:string(resource_attributes."k8s.namespace.name")
Symptom: "Field not found" error
Fix: Quote field names with dots: object."field.with.dots"
Pitfall 3: Assuming Field Names
WRONG:
# Assuming all logs have "message" field
filter message ~ /error/i
CORRECT:
# Check schema first!
# K8s logs use "body", CloudWatch uses "message"
discover_context(dataset_id="...")
# Then use correct field
filter body ~ /error/i
Symptom: "Column not found: message" error
Fix: ALWAYS run discover_context(dataset_id="...") to get exact field names
Pitfall 4: Case-Sensitive Text Search
WRONG:
filter contains(body, "error") # Misses "Error", "ERROR"
CORRECT:
# Option 1: Multiple contains
filter contains(body, "error") or contains(body, "ERROR") or contains(body, "Error")
# Option 2: Regex (better)
filter body ~ /error/i
Symptom: Missing errors that use different capitalization
Fix: Use regex with /i flag for case-insensitive matching
Pitfall 5: Narrow Error Detection
WRONG:
filter stream = "stderr" # Misses stdout errors
CORRECT:
filter body ~ /error|exception|failed/i
or stream = "stderr"
Symptom: Missing errors that don't match single condition
Fix: Use wide net strategy - combine multiple error indicators
Pitfall 6: Using statsby for Time-Series
WRONG:
# Trying to get hourly trends
filter body ~ /error/i
| statsby count(), group_by(namespace) # Returns ONE row per namespace
CORRECT:
# Use timechart for time-series
filter body ~ /error/i
| timechart 1h, count(), group_by(namespace) # Returns multiple rows (time buckets)
Symptom: Getting summary instead of trends
Fix: Use timechart for time-series, statsby for single summary
Pitfall 7: Forgetting Type Casting
WRONG:
make_col namespace:resource_attributes."k8s.namespace.name"
# Might be variant type, causes issues in group_by
CORRECT:
make_col namespace:string(resource_attributes."k8s.namespace.name")
Symptom: Aggregation errors or unexpected grouping
Fix: Cast nested fields to expected type: string(), int64(), etc.
Cross-References
Related Skills
filtering-event-datasets:
- Basic filtering syntax (
contains(),~, comparison operators) - When to use
filtervs aggregation - Use for: Simple known-value filtering
aggregating-event-datasets:
statsbyfor aggregationsmake_colfor derived columns- Aggregation functions (
count(),sum(),any()) - Use for: Counting, grouping, summarizing
time-series-analysis:
timechartfor temporal trending- Time bucket configuration
- Use for: Error trends over time
analyzing-apm-data:
- Span-based error analysis (
errorfield,error_message) - Service-level error tracking
- Use for: APM/tracing error investigation
aggregating-gauge-metrics:
- Error count metrics (
error_count_5m) - Volume trending with metrics
- Use for: High-level error volume (fast)
When to Use Which Skill
| User Question | Skill to Use | Why |
|---|---|---|
| "Show me errors in K8s logs" | investigating-textual-data | Text search in logs |
| "What's the error rate for my service?" | analyzing-apm-data | APM metrics |
| "Count errors by service (metrics)" | aggregating-gauge-metrics | Pre-aggregated metrics |
| "Show error trends over 24h" | time-series-analysis | Time-series aggregation |
| "Filter logs for namespace=production" | filtering-event-datasets | Simple filtering |
| "Count errors by container" | aggregating-event-datasets | Event aggregation |
Decision Matrix
User asks about errors
|
v
From what source?
|
+---+---+---+
| | | |
Logs Spans Metrics No source specified
| | | |
| | | v
| | | Discover textual datasets
| | | (kubernetes logs, cloudwatch, etc.)
| | | |
v v v v
| | |
Textual APM Volume
investigation |
| | v
| | aggregating-gauge-metrics
| | (error_count_5m)
| |
| v
| analyzing-apm-data
| (spans, error field)
|
v
investigating-textual-data
(logs, events, regex, wide net)
Summary
Key Takeaways:
- Always discover schema first - Field names vary by dataset
- Use regex with
/pattern/i- Forward slashes, NOT string quotes - Cast wide net - Combine text patterns + severity + stream
- Quote nested fields -
resource_attributes."k8s.namespace.name" - Sample with
any()- Get error counts WITH example messages - Metrics vs logs - Metrics for volume, logs for details
statsbyvstimechart- Summary vs time-series
Common workflow:
1. discover_context("user intent keywords")
2. discover_context(dataset_id="...") # Get schema
3. Write wide net filter (regex + severity + stream)
4. Extract context (namespace, service, pod)
5. Aggregate with samples (count + any())
6. Sort and limit results
For more:
- OPAL syntax → filtering-event-datasets
- Aggregations → aggregating-event-datasets
- Trends → time-series-analysis
- APM errors → analyzing-apm-data
- Pattern discovery → analyzing-text-patterns
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
working-with-reference-tables
Work with Reference Tables (static CSV lookup data) using OPAL to enrich datasets with descriptive information. Use when you need to map IDs to human-readable names, add static metadata from CSV uploads, or perform lookups without temporal considerations. Covers both explicit and implicit lookup patterns, column name matching, and when to choose Reference Tables vs Resources vs Correlation Tags.
working-with-resources
Work with Resource datasets (mutable state tracking) using OPAL temporal joins. Use when you need to enrich Events/Intervals with contextual state information, track resource state changes over time, or navigate between datasets using temporal relationships. Covers temporal join mechanics (lookup, join, follow), automatic field matching, and when to use Resources vs Reference Tables.
analyzing-text-patterns
Extract and analyze recurring patterns from log messages, span names, and event names using punctuation-based template discovery. Use when you need to understand log diversity, identify common message structures, detect unusual formats, or prepare for log parser development. Works by removing variable content and preserving structural markers.
time-series-analysis
Analyze event datasets (logs) and intervals over time using OPAL timechart. Use when you need to visualize trends, track metrics over time, or create time-series charts. Covers timechart for temporal binning, bin duration options (1h, 5m, 1d), options(bins:N) for controlling bin count, and understanding temporal output columns (_c_valid_from, _c_valid_to, _c_bucket). Returns multiple rows per group for time-series visualization. For single summaries, see aggregating-event-datasets skill.
detecting-anomalies
Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type.
aggregating-gauge-metrics
Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.
Didn't find tool you were looking for?