Agent skill

ci-watch-and-fix

Watch GitHub Actions CI after PR creation, detect failures, extract logs, apply minimal fixes, and re-push — keeping the delivery session alive until CI resolves or escalating after 3 cycles. Activate immediately after gh pr create and before marking the story done.

Stars 123
Forks 27

Install this agent skill to your Project

npx add-skill https://github.com/Fr-e-d/GAAI-framework/tree/main/.gaai/core/skills/delivery/ci-watch-and-fix

Metadata

Additional technical details for this skill

id
SKILL-DELIVERY-CI-WATCH-001
owner
Delivery Orchestrator
track
delivery
author
gaai-framework
status
stable
version
1.0
category
delivery
updated at
1772409600

SKILL.md

CI Watch and Fix

Purpose / When to Activate

Owner: Delivery Orchestrator.

Activate immediately after gh pr create and before marking the story done.

This skill keeps the delivery session alive through GitHub Actions CI execution. It detects failures, fetches logs, applies minimal fixes, and re-pushes — up to 3 remediation cycles. If CI does not converge within 3 cycles, it escalates without marking the story done.

Do NOT use gh pr checks --watch. Active polling is mandatory to ensure the log file receives periodic output and the daemon heartbeat monitor does not falsely kill the session. See AC7.


External Dependencies

  • gh CLI authenticated with repo + actions:read scopes (already present in the project environment — no additional setup required).

Process

Initialization

cycle = 0
flaky_retry_used = false
previous_failure_signatures = {}   # map: check_name → error_message_hash

Step 0 — Branch Protection Check (once, before loop)

# Determine if CI is a hard gate or advisory
# gh api returns 403 on repos without branch protection (free/private)
bp_status = gh api repos/<repo>/branches/staging/protection --jq '.required_status_checks' 2>&1
if bp_status contains "403" OR bp_status contains "404" OR bp_status is empty:
    ci_is_advisory = true
    echo "[ci-watch-and-fix] No branch protection on staging — CI is advisory" >> $LOG_DIR/<story-id>.log
else:
    ci_is_advisory = false
    echo "[ci-watch-and-fix] Branch protection active — CI is a hard gate" >> $LOG_DIR/<story-id>.log

Main Loop (max 3 cycles)

while cycle < 3:
    cycle += 1

    # Heartbeat — always write a log line at the start of each cycle
    echo "[ci-watch-and-fix] cycle ${cycle}/3 — polling PR #<pr-number> checks" >> $LOG_DIR/<story-id>.log

    # Step 1 — Poll PR checks
    run: gh pr checks <pr-number> --repo <repo>

    # Step 1b — No checks registered?
    # If no CI checks are registered on the PR (no workflows triggered),
    # treat as advisory pass — nothing to wait for.
    if no checks exist:
        echo "[ci-watch-and-fix] No CI checks registered — CI PASS (no checks)" >> $LOG_DIR/<story-id>.log
        exit loop → return CI PASS

    # Step 2 — All passing?
    if all checks pass:
        echo "[ci-watch-and-fix] CI PASS — all checks green" >> $LOG_DIR/<story-id>.log
        exit loop → return CI PASS

    # Step 3 — Identify failed checks and their run IDs
    for each failed check:
        get run_id from check

        # Step 4 — Fetch failure logs (truncated to last 3000 chars per job)
        raw_log = gh run view <run-id> --repo <repo> --log-failed
        failure_log = last 3000 chars of raw_log

        # Step 4b — Pre-existing infra failure detection (fast-path)
        # Detect infrastructure-level failures that code changes cannot fix.
        # These are pre-existing conditions unrelated to the story's changes.
        INFRA_PATTERNS = [
            "recent account payments have failed",
            "spending limit needs to be increased",
            "Actions minutes",
            "Actions quota",
            "not started because",       # job queuing failure (billing gate)
            "out of Actions minutes",
        ]
        if any(pattern matches failure_log) for any failed job:
            if ci_is_advisory:
                echo "[ci-watch-and-fix] Infra failure detected but CI is advisory (no branch protection) — CI PASS (advisory skip)" >> $LOG_DIR/<story-id>.log
                exit loop → return CI PASS (advisory)
            else:
                echo "[ci-watch-and-fix] Infra failure detected AND branch protection active — ESCALATE (cannot merge)" >> $LOG_DIR/<story-id>.log
                convergence_failure_reason = "Pre-existing infrastructure failure: GitHub Actions billing/quota limit. Branch protection prevents merge without CI PASS."
                goto ESCALATE

        # Step 5 — Flaky test detection
        signature = hash(check_name + first_100_chars_of_failure_log)
        if signature in previous_failure_signatures:
            # Same failure seen in a previous cycle → suspected flaky
            if flaky_retry_used:
                # Already used the one flaky retry → escalate
                goto ESCALATE
            else:
                flaky_retry_used = true
                echo "[ci-watch-and-fix] suspected flaky test in <check_name> — pushing empty commit retry" >> $LOG_DIR/<story-id>.log
                git commit --allow-empty -m "ci: retry (suspected flaky)" (in worktree)
                git push origin <story_branch> (in worktree)
                sleep 60
                continue  # next cycle without applying code changes
        else:
            previous_failure_signatures[signature] = true

        # Step 6 — Analyze and fix (non-flaky failures)
        analyze failure_log to identify root cause
        apply minimal corrective code changes (in worktree — do not expand scope)
        git add → git commit -m "fix(ci/<story-id>): <description>" (in worktree)

    # Push all fixes
    git push origin <story_branch> (in worktree)

    # Step 7 — Wait then re-poll
    echo "[ci-watch-and-fix] fixes pushed — waiting 60s before re-poll" >> $LOG_DIR/<story-id>.log
    sleep 60

# Exhausted 3 cycles without CI PASS
goto ESCALATE

Heartbeat Rule

The daemon heartbeat monitor kills sessions silent for >30 minutes. This skill MUST emit at least one line to $LOG_DIR/<story-id>.log every 5 minutes during CI wait time. During the 60-second sleep between cycles, this is not an issue. If a single CI run takes >5 minutes to complete, emit periodic heartbeat lines:

# During long CI waits, poll every 60s and emit a heartbeat line each time
while ci_running:
    sleep 60
    echo "[ci-watch-and-fix] waiting for CI — elapsed: <N>s" >> $LOG_DIR/<story-id>.log
    check if checks are still in_progress

Escalation Path (AC3)

Trigger when: (cycle > 3 AND CI not passing, OR flaky retries exhausted) AND ci_is_advisory == false.

When ci_is_advisory == true, infra failures and exhausted retries produce CI PASS (advisory) — never CI FAIL. The merge proceeds. The escalation path below only applies when branch protection is active.

ESCALATE:
    # 1. Produce ci_remediation_report
    report_path = docs/ci-failures/<story-id>-<timestamp>.md
    write report containing:
        - story_id
        - pr_number
        - total_cycles_attempted
        - flaky_retry_used
        - per-cycle summary:
            - cycle number
            - checks that failed
            - failure log excerpt (last 500 chars)
            - fix attempted (or "flaky retry" / "none")
        - convergence_failure_reason: why CI did not converge

    # 2. Commit the report to the PR branch (in worktree)
    git add <report_path>
    git commit -m "ci(<story-id>): CI remediation report — convergence failed"
    git push origin <story_branch>

    # 3. Return CI FAIL — do NOT mark story done
    # The delivery wrapper's on_exit trap will mark the story failed (non-zero exit)
    return CI FAIL

NEVER mark the story done when returning CI FAIL.


Flaky Test Detection Heuristic (AC4)

A CI failure is classified as likely flaky if:

  1. The same CI check fails in the current cycle AND
  2. A previous cycle saw a failure in that same check with an identical error message (matched via the first 100 characters of the failure log for that check)

When a likely-flaky failure is detected:

  • Do NOT apply code changes
  • Push an empty commit to re-trigger CI: git commit --allow-empty -m "ci: retry (suspected flaky)"
  • Count this as consuming the flaky retry slot (max 1 flaky retry total per story)
  • If the flaky retry slot is already used and the same failure recurs → escalate

Fix Principles

When applying corrective code changes for non-flaky failures:

  • Minimal change only — fix what CI identifies, nothing more
  • No scope expansion — do not refactor, add features, or change behavior beyond the CI failure
  • Commit message convention: fix(ci/<story-id>): <description> — distinguishable from feature commits
  • Truncate logs: analyze only the last 3000 chars of each failed job log to stay within context limits

Outputs

CI PASS:

status: CI PASS
cycles_used: <n>
flaky_retry_used: <true|false>

CI PASS (advisory):

status: CI PASS
advisory: true
reason: <"no_checks" | "infra_failure_advisory">
note: "CI failed but branch protection is not active — merge permitted"

The Delivery Orchestrator treats CI PASS (advisory) identically to CI PASS — it proceeds to merge. The advisory flag is logged for traceability but does not block the delivery.

CI FAIL:

status: CI FAIL
cycles_used: 3
flaky_retry_used: <true|false>
escalation_reason: <why convergence failed>
remediation_report: docs/ci-failures/<story-id>-<timestamp>.md

CI FAIL is only returned when branch protection is active AND CI cannot pass. When branch protection is absent, infra failures produce CI PASS (advisory) instead.


Non-Goals

This skill must NOT:

  • Modify acceptance criteria or product scope
  • Apply fixes to pre-existing CI failures unrelated to this story's changes
  • Attempt to fix infrastructure failures (missing secrets, missing bindings, quota limits, billing limits) — these are detected via Step 4b fast-path (do NOT burn retry cycles). When ci_is_advisory, they produce CI PASS (advisory). When branch protection is active, they produce ESCALATE.
  • Merge the PR (that is the Orchestrator's responsibility after CI PASS)
  • Use gh pr checks --watch (heartbeat requirement — see AC7)

Quality Checks

  • Every cycle emits at least one heartbeat line to $LOG_DIR/<story-id>.log
  • Flaky detection compares against previous cycle signatures, not just the current cycle
  • Escalation report is committed before returning CI FAIL
  • Story is never marked done on CI FAIL
  • Log truncation is applied before analysis (max 3000 chars per job)

Expand your agent's capabilities with these related and highly-rated skills.

Fr-e-d/GAAI-framework

qa-review

Validate that implemented code fully satisfies Story acceptance criteria, respects rules, and introduces no regressions. This is the hard quality gate — no pass means no delivery. Activate after implementation is complete.

123 27
Explore
Fr-e-d/GAAI-framework

compose-team

Assemble the context bundles for each sub-agent based on evaluate-story output. Produces spawn-ready packages for Planning, Implementation, QA, or MicroDelivery sub-agents. Activate after evaluate-story, before spawning any sub-agent.

123 27
Explore
Fr-e-d/GAAI-framework

coordinate-handoffs

Validate sub-agent handoff artefacts, sequence phase transitions, and manage retry and escalation logic. Activate after each sub-agent terminates to determine next action.

123 27
Explore
Fr-e-d/GAAI-framework

implement

Generate correct, minimal, maintainable code that satisfies a validated Story's acceptance criteria against an execution plan. Activate when a Story is validated, a plan exists, and all prerequisites are unambiguous.

123 27
Explore
Fr-e-d/GAAI-framework

delivery-high-level-plan

Transform validated Stories into a clear, minimal, governed execution plan. Used by the Planning Sub-Agent as the first planning pass before prepare-execution-plan for Tier 2/3, or as the sole planning output for simple Stories.

123 27
Explore
Fr-e-d/GAAI-framework

prepare-execution-plan

Decompose a high-level delivery plan into a precise, file-level execution sequence with explicit ordering, edge cases, and test checkpoints. Activate after delivery-high-level-plan for complex or multi-phase Stories before implementation begins.

123 27
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results