Agent skill
ci-watch-and-fix
Watch GitHub Actions CI after PR creation, detect failures, extract logs, apply minimal fixes, and re-push — keeping the delivery session alive until CI resolves or escalating after 3 cycles. Activate immediately after gh pr create and before marking the story done.
Install this agent skill to your Project
npx add-skill https://github.com/Fr-e-d/GAAI-framework/tree/main/.gaai/core/skills/delivery/ci-watch-and-fix
Metadata
Additional technical details for this skill
- id
- SKILL-DELIVERY-CI-WATCH-001
- owner
- Delivery Orchestrator
- track
- delivery
- author
- gaai-framework
- status
- stable
- version
- 1.0
- category
- delivery
- updated at
- 1772409600
SKILL.md
CI Watch and Fix
Purpose / When to Activate
Owner: Delivery Orchestrator.
Activate immediately after gh pr create and before marking the story done.
This skill keeps the delivery session alive through GitHub Actions CI execution. It detects failures, fetches logs, applies minimal fixes, and re-pushes — up to 3 remediation cycles. If CI does not converge within 3 cycles, it escalates without marking the story done.
Do NOT use gh pr checks --watch. Active polling is mandatory to ensure the log file receives periodic output and the daemon heartbeat monitor does not falsely kill the session. See AC7.
External Dependencies
ghCLI authenticated withrepo+actions:readscopes (already present in the project environment — no additional setup required).
Process
Initialization
cycle = 0
flaky_retry_used = false
previous_failure_signatures = {} # map: check_name → error_message_hash
Step 0 — Branch Protection Check (once, before loop)
# Determine if CI is a hard gate or advisory
# gh api returns 403 on repos without branch protection (free/private)
bp_status = gh api repos/<repo>/branches/staging/protection --jq '.required_status_checks' 2>&1
if bp_status contains "403" OR bp_status contains "404" OR bp_status is empty:
ci_is_advisory = true
echo "[ci-watch-and-fix] No branch protection on staging — CI is advisory" >> $LOG_DIR/<story-id>.log
else:
ci_is_advisory = false
echo "[ci-watch-and-fix] Branch protection active — CI is a hard gate" >> $LOG_DIR/<story-id>.log
Main Loop (max 3 cycles)
while cycle < 3:
cycle += 1
# Heartbeat — always write a log line at the start of each cycle
echo "[ci-watch-and-fix] cycle ${cycle}/3 — polling PR #<pr-number> checks" >> $LOG_DIR/<story-id>.log
# Step 1 — Poll PR checks
run: gh pr checks <pr-number> --repo <repo>
# Step 1b — No checks registered?
# If no CI checks are registered on the PR (no workflows triggered),
# treat as advisory pass — nothing to wait for.
if no checks exist:
echo "[ci-watch-and-fix] No CI checks registered — CI PASS (no checks)" >> $LOG_DIR/<story-id>.log
exit loop → return CI PASS
# Step 2 — All passing?
if all checks pass:
echo "[ci-watch-and-fix] CI PASS — all checks green" >> $LOG_DIR/<story-id>.log
exit loop → return CI PASS
# Step 3 — Identify failed checks and their run IDs
for each failed check:
get run_id from check
# Step 4 — Fetch failure logs (truncated to last 3000 chars per job)
raw_log = gh run view <run-id> --repo <repo> --log-failed
failure_log = last 3000 chars of raw_log
# Step 4b — Pre-existing infra failure detection (fast-path)
# Detect infrastructure-level failures that code changes cannot fix.
# These are pre-existing conditions unrelated to the story's changes.
INFRA_PATTERNS = [
"recent account payments have failed",
"spending limit needs to be increased",
"Actions minutes",
"Actions quota",
"not started because", # job queuing failure (billing gate)
"out of Actions minutes",
]
if any(pattern matches failure_log) for any failed job:
if ci_is_advisory:
echo "[ci-watch-and-fix] Infra failure detected but CI is advisory (no branch protection) — CI PASS (advisory skip)" >> $LOG_DIR/<story-id>.log
exit loop → return CI PASS (advisory)
else:
echo "[ci-watch-and-fix] Infra failure detected AND branch protection active — ESCALATE (cannot merge)" >> $LOG_DIR/<story-id>.log
convergence_failure_reason = "Pre-existing infrastructure failure: GitHub Actions billing/quota limit. Branch protection prevents merge without CI PASS."
goto ESCALATE
# Step 5 — Flaky test detection
signature = hash(check_name + first_100_chars_of_failure_log)
if signature in previous_failure_signatures:
# Same failure seen in a previous cycle → suspected flaky
if flaky_retry_used:
# Already used the one flaky retry → escalate
goto ESCALATE
else:
flaky_retry_used = true
echo "[ci-watch-and-fix] suspected flaky test in <check_name> — pushing empty commit retry" >> $LOG_DIR/<story-id>.log
git commit --allow-empty -m "ci: retry (suspected flaky)" (in worktree)
git push origin <story_branch> (in worktree)
sleep 60
continue # next cycle without applying code changes
else:
previous_failure_signatures[signature] = true
# Step 6 — Analyze and fix (non-flaky failures)
analyze failure_log to identify root cause
apply minimal corrective code changes (in worktree — do not expand scope)
git add → git commit -m "fix(ci/<story-id>): <description>" (in worktree)
# Push all fixes
git push origin <story_branch> (in worktree)
# Step 7 — Wait then re-poll
echo "[ci-watch-and-fix] fixes pushed — waiting 60s before re-poll" >> $LOG_DIR/<story-id>.log
sleep 60
# Exhausted 3 cycles without CI PASS
goto ESCALATE
Heartbeat Rule
The daemon heartbeat monitor kills sessions silent for >30 minutes. This skill MUST emit at least one line to $LOG_DIR/<story-id>.log every 5 minutes during CI wait time. During the 60-second sleep between cycles, this is not an issue. If a single CI run takes >5 minutes to complete, emit periodic heartbeat lines:
# During long CI waits, poll every 60s and emit a heartbeat line each time
while ci_running:
sleep 60
echo "[ci-watch-and-fix] waiting for CI — elapsed: <N>s" >> $LOG_DIR/<story-id>.log
check if checks are still in_progress
Escalation Path (AC3)
Trigger when: (cycle > 3 AND CI not passing, OR flaky retries exhausted) AND ci_is_advisory == false.
When ci_is_advisory == true, infra failures and exhausted retries produce CI PASS (advisory) — never CI FAIL. The merge proceeds. The escalation path below only applies when branch protection is active.
ESCALATE:
# 1. Produce ci_remediation_report
report_path = docs/ci-failures/<story-id>-<timestamp>.md
write report containing:
- story_id
- pr_number
- total_cycles_attempted
- flaky_retry_used
- per-cycle summary:
- cycle number
- checks that failed
- failure log excerpt (last 500 chars)
- fix attempted (or "flaky retry" / "none")
- convergence_failure_reason: why CI did not converge
# 2. Commit the report to the PR branch (in worktree)
git add <report_path>
git commit -m "ci(<story-id>): CI remediation report — convergence failed"
git push origin <story_branch>
# 3. Return CI FAIL — do NOT mark story done
# The delivery wrapper's on_exit trap will mark the story failed (non-zero exit)
return CI FAIL
NEVER mark the story done when returning CI FAIL.
Flaky Test Detection Heuristic (AC4)
A CI failure is classified as likely flaky if:
- The same CI check fails in the current cycle AND
- A previous cycle saw a failure in that same check with an identical error message (matched via the first 100 characters of the failure log for that check)
When a likely-flaky failure is detected:
- Do NOT apply code changes
- Push an empty commit to re-trigger CI:
git commit --allow-empty -m "ci: retry (suspected flaky)" - Count this as consuming the flaky retry slot (max 1 flaky retry total per story)
- If the flaky retry slot is already used and the same failure recurs → escalate
Fix Principles
When applying corrective code changes for non-flaky failures:
- Minimal change only — fix what CI identifies, nothing more
- No scope expansion — do not refactor, add features, or change behavior beyond the CI failure
- Commit message convention:
fix(ci/<story-id>): <description>— distinguishable from feature commits - Truncate logs: analyze only the last 3000 chars of each failed job log to stay within context limits
Outputs
CI PASS:
status: CI PASS
cycles_used: <n>
flaky_retry_used: <true|false>
CI PASS (advisory):
status: CI PASS
advisory: true
reason: <"no_checks" | "infra_failure_advisory">
note: "CI failed but branch protection is not active — merge permitted"
The Delivery Orchestrator treats CI PASS (advisory) identically to CI PASS — it proceeds to merge. The advisory flag is logged for traceability but does not block the delivery.
CI FAIL:
status: CI FAIL
cycles_used: 3
flaky_retry_used: <true|false>
escalation_reason: <why convergence failed>
remediation_report: docs/ci-failures/<story-id>-<timestamp>.md
CI FAIL is only returned when branch protection is active AND CI cannot pass. When branch protection is absent, infra failures produce CI PASS (advisory) instead.
Non-Goals
This skill must NOT:
- Modify acceptance criteria or product scope
- Apply fixes to pre-existing CI failures unrelated to this story's changes
- Attempt to fix infrastructure failures (missing secrets, missing bindings, quota limits, billing limits) — these are detected via Step 4b fast-path (do NOT burn retry cycles). When
ci_is_advisory, they produce CI PASS (advisory). When branch protection is active, they produce ESCALATE. - Merge the PR (that is the Orchestrator's responsibility after CI PASS)
- Use
gh pr checks --watch(heartbeat requirement — see AC7)
Quality Checks
- Every cycle emits at least one heartbeat line to
$LOG_DIR/<story-id>.log - Flaky detection compares against previous cycle signatures, not just the current cycle
- Escalation report is committed before returning CI FAIL
- Story is never marked
doneon CI FAIL - Log truncation is applied before analysis (max 3000 chars per job)
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
qa-review
Validate that implemented code fully satisfies Story acceptance criteria, respects rules, and introduces no regressions. This is the hard quality gate — no pass means no delivery. Activate after implementation is complete.
compose-team
Assemble the context bundles for each sub-agent based on evaluate-story output. Produces spawn-ready packages for Planning, Implementation, QA, or MicroDelivery sub-agents. Activate after evaluate-story, before spawning any sub-agent.
coordinate-handoffs
Validate sub-agent handoff artefacts, sequence phase transitions, and manage retry and escalation logic. Activate after each sub-agent terminates to determine next action.
implement
Generate correct, minimal, maintainable code that satisfies a validated Story's acceptance criteria against an execution plan. Activate when a Story is validated, a plan exists, and all prerequisites are unambiguous.
delivery-high-level-plan
Transform validated Stories into a clear, minimal, governed execution plan. Used by the Planning Sub-Agent as the first planning pass before prepare-execution-plan for Tier 2/3, or as the sole planning output for simple Stories.
prepare-execution-plan
Decompose a high-level delivery plan into a precise, file-level execution sequence with explicit ordering, edge cases, and test checkpoints. Activate after delivery-high-level-plan for complex or multi-phase Stories before implementation begins.
Didn't find tool you were looking for?