Agent skill
sym-debug
Investigate stuck runs and execution failures by tracing Symphony and Codex logs with issue/session identifiers; use when runs stall, retry repeatedly, or fail unexpectedly.
Install this agent skill to your Project
npx add-skill https://github.com/gannonh/kata/tree/main/apps/symphony/skills/sym-debug
SKILL.md
Debug
Goals
- Find why a run is stuck, retrying, or failing.
- Correlate Linear issue identity to a Codex session quickly.
- Read the right logs in the right order to isolate root cause.
Log Sources
- Primary runtime log file:
<logs-root>/log/symphony.log- When Symphony runs with
--logs-root, it writes rotating JSON logs under this path (seeapps/symphony/README.md). - Includes orchestrator, agent runner, and Codex app-server lifecycle logs.
- When Symphony runs with
- Rotated runtime logs:
<logs-root>/log/symphony.log*- Check these when the relevant run is older than the active file.
- Stdout fallback: structured JSON log stream
- Without
--logs-root, logs stream to stdout instead of a file.
- Without
Correlation Keys
issue_identifier: human ticket key (example:MT-625)issue_id: Linear UUID (stable internal ID)session_id: Codex thread-turn pair (<thread_id>-<turn_id>)
These fields are emitted by Symphony runtime lifecycle logs (notably in
apps/symphony/src/orchestrator.rs and apps/symphony/src/codex/app_server.rs).
Use them as your join keys during debugging.
Quick Triage (Stuck Run)
- Confirm scheduler/worker symptoms for the ticket.
- Find recent lines for the ticket (
issue_identifierfirst). - Extract
session_idfrom matching lines. - Trace that
session_idacross start, stream, completion/failure, and stall handling logs. - Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.
Commands
# File-log mode (`--logs-root` enabled): expand to active + rotated files.
LOG_PATHS=( ${LOG_GLOB:-log/symphony.log*} )
# 1) Narrow by ticket key (fastest entry point)
rg -n "issue_identifier=MT-625" "${LOG_PATHS[@]}"
# 2) If needed, narrow by Linear UUID
rg -n "issue_id=<linear-uuid>" "${LOG_PATHS[@]}"
# 3) Pull session IDs seen for that ticket
rg -o "session_id=[^ ;]+" "${LOG_PATHS[@]}" | sort -u
# 4) Trace one session end-to-end
rg -n "session_id=<thread>-<turn>" "${LOG_PATHS[@]}"
# 5) Focus on stuck/retry signals
rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" "${LOG_PATHS[@]}"
# Stdout mode (startup banner shows `Logs: stdout`): use your runtime stream.
journalctl -u symphony --since "30 minutes ago" --no-pager \
| rg -n "issue_identifier=MT-625|issue_id=<linear-uuid>|session_id=<thread>-<turn>|Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error"
# Containerized deploys can use docker logs instead of journalctl.
docker logs <symphony-container> --since 30m 2>&1 \
| rg -n "issue_identifier=MT-625|issue_id=<linear-uuid>|session_id=<thread>-<turn>|Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error"
Investigation Flow
- Locate the ticket slice:
- Search by
issue_identifier=<KEY>. - If noise is high, add
issue_id=<UUID>.
- Search by
- Establish timeline:
- Identify first
Codex session started ... session_id=.... - Follow with
Codex session completed,ended with error, or worker exit lines.
- Identify first
- Classify the problem:
- Stall loop:
Issue stalled ... restarting with backoff. - App-server startup:
Codex session failed .... - Turn execution failure:
turn_failed,turn_cancelled,turn_timeout, orended with error. - Worker crash:
Agent task exited ... reason=....
- Stall loop:
- Validate scope:
- Check whether failures are isolated to one issue/session or repeating across multiple tickets.
- Capture evidence:
- Save key log lines with timestamps,
issue_identifier,issue_id, andsession_id. - Record probable root cause and the exact failing stage.
- Save key log lines with timestamps,
Reading Codex Session Logs
In Symphony, Codex session diagnostics are emitted into log/symphony.log and
keyed by session_id. Read them as a lifecycle:
Codex session started ... session_id=...- Session stream/lifecycle events for the same
session_id - Terminal event:
Codex session completed ..., orCodex session ended with error ..., orIssue stalled ... restarting with backoff
For one specific session investigation, keep the trace narrow:
- Capture one
session_idfor the ticket. - Build a timestamped slice for only that session:
rg -n "session_id=<thread>-<turn>" "$LOG_GLOB"
- Mark the exact failing stage:
- Startup failure before stream events (
Codex session failed ...). - Turn/runtime failure after stream events (
turn_*/ended with error). - Stall recovery (
Issue stalled ... restarting with backoff).
- Startup failure before stream events (
- Pair findings with
issue_identifierandissue_idfrom nearby lines to confirm you are not mixing concurrent retries.
Always pair session findings with issue_identifier/issue_id to avoid mixing
concurrent runs.
Notes
- Prefer
rgovergrepfor speed on large logs. - Check rotated logs (
<logs-root>/log/symphony.log*) before concluding data is missing. - If required context fields are missing in new log statements, align with
existing structured lifecycle logging in
apps/symphony/src/orchestrator.rsandapps/symphony/src/codex/app_server.rs.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
kata-context
Structural and semantic codebase intelligence with persistent memory — index TypeScript and Python repos into a knowledge graph with vector embeddings, query symbol dependencies, run semantic search by intent, search code patterns, fuzzy-find symbols, and persist/recall agent memories with git audit trail. Use when you need to understand code structure, find what depends on a symbol, trace dependencies, search by meaning, search for code patterns, find symbols by name, or remember/recall project decisions, patterns, and learnings.
claude-md-improver
Audit and improve CLAUDE.md files in repositories. Use when user asks to check, audit, update, improve, or fix CLAUDE.md files. Scans for all CLAUDE.md files, evaluates quality against templates, outputs quality report, then makes targeted updates. Also use when the user mentions "CLAUDE.md maintenance" or "project memory optimization".
frontend-design
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
debug-like-expert
Deep analysis debugging mode for complex issues. Activates methodical investigation protocol with evidence gathering, hypothesis testing, and rigorous verification. Use when standard troubleshooting fails or when issues require systematic root cause analysis.
swiftui
SwiftUI apps from scratch through App Store. Full lifecycle - create, debug, test, optimize, ship.
sym-address-comments
Help address review/issue comments on the open GitHub PR for the current branch using gh CLI; verify gh auth first and prompt the user to authenticate if not logged in.
Didn't find tool you were looking for?