Agent skill

ci-watch-and-fix

Watch GitHub Actions CI after PR creation, detect failures, extract logs, apply minimal fixes, and re-push — keeping the delivery session alive until CI resolves or escalating after 3 cycles. Activate immediately after gh pr create and before marking the story done.

View SKILL.md on GitHub Repository

Stars 123

Forks 27

Install this agent skill to your Project

npx add-skill https://github.com/Fr-e-d/GAAI-framework/tree/main/.gaai/core/skills/delivery/ci-watch-and-fix

Metadata

Additional technical details for this skill

id: SKILL-DELIVERY-CI-WATCH-001
owner: Delivery Orchestrator
track: delivery
author: gaai-framework
status: stable
version: 1.0
category: delivery
updated at: 1772409600

SKILL.md

CI Watch and Fix

Purpose / When to Activate

Owner: Delivery Orchestrator.

Activate immediately after gh pr create and before marking the story done.

This skill keeps the delivery session alive through GitHub Actions CI execution. It detects failures, fetches logs, applies minimal fixes, and re-pushes — up to 3 remediation cycles. If CI does not converge within 3 cycles, it escalates without marking the story done.

Do NOT use gh pr checks --watch. Active polling is mandatory to ensure the log file receives periodic output and the daemon heartbeat monitor does not falsely kill the session. See AC7.

External Dependencies

gh CLI authenticated with repo + actions:read scopes (already present in the project environment — no additional setup required).

Process

Initialization

cycle = 0
flaky_retry_used = false
previous_failure_signatures = {}   # map: check_name → error_message_hash

Step 0 — Branch Protection Check (once, before loop)

# Determine if CI is a hard gate or advisory
# gh api returns 403 on repos without branch protection (free/private)
bp_status = gh api repos/<repo>/branches/staging/protection --jq '.required_status_checks' 2>&1
if bp_status contains "403" OR bp_status contains "404" OR bp_status is empty:
    ci_is_advisory = true
    echo "[ci-watch-and-fix] No branch protection on staging — CI is advisory" >> $LOG_DIR/<story-id>.log
else:
    ci_is_advisory = false
    echo "[ci-watch-and-fix] Branch protection active — CI is a hard gate" >> $LOG_DIR/<story-id>.log

Main Loop (max 3 cycles)

while cycle < 3:
    cycle += 1

    # Heartbeat — always write a log line at the start of each cycle
    echo "[ci-watch-and-fix] cycle ${cycle}/3 — polling PR #<pr-number> checks" >> $LOG_DIR/<story-id>.log

    # Step 1 — Poll PR checks
    run: gh pr checks <pr-number> --repo <repo>

    # Step 1b — No checks registered?
    # If no CI checks are registered on the PR (no workflows triggered),
    # treat as advisory pass — nothing to wait for.
    if no checks exist:
        echo "[ci-watch-and-fix] No CI checks registered — CI PASS (no checks)" >> $LOG_DIR/<story-id>.log
        exit loop → return CI PASS

    # Step 2 — All passing?
    if all checks pass:
        echo "[ci-watch-and-fix] CI PASS — all checks green" >> $LOG_DIR/<story-id>.log
        exit loop → return CI PASS

    # Step 3 — Identify failed checks and their run IDs
    for each failed check:
        get run_id from check

        # Step 4 — Fetch failure logs (truncated to last 3000 chars per job)
        raw_log = gh run view <run-id> --repo <repo> --log-failed
        failure_log = last 3000 chars of raw_log

        # Step 4b — Pre-existing infra failure detection (fast-path)
        # Detect infrastructure-level failures that code changes cannot fix.
        # These are pre-existing conditions unrelated to the story's changes.
        INFRA_PATTERNS = [
            "recent account payments have failed",
            "spending limit needs to be increased",
            "Actions minutes",
            "Actions quota",
            "not started because",       # job queuing failure (billing gate)
            "out of Actions minutes",
        ]
        if any(pattern matches failure_log) for any failed job:
            if ci_is_advisory:
                echo "[ci-watch-and-fix] Infra failure detected but CI is advisory (no branch protection) — CI PASS (advisory skip)" >> $LOG_DIR/<story-id>.log
                exit loop → return CI PASS (advisory)
            else:
                echo "[ci-watch-and-fix] Infra failure detected AND branch protection active — ESCALATE (cannot merge)" >> $LOG_DIR/<story-id>.log
                convergence_failure_reason = "Pre-existing infrastructure failure: GitHub Actions billing/quota limit. Branch protection prevents merge without CI PASS."
                goto ESCALATE

        # Step 5 — Flaky test detection
        signature = hash(check_name + first_100_chars_of_failure_log)
        if signature in previous_failure_signatures:
            # Same failure seen in a previous cycle → suspected flaky
            if flaky_retry_used:
                # Already used the one flaky retry → escalate
                goto ESCALATE
            else:
                flaky_retry_used = true
                echo "[ci-watch-and-fix] suspected flaky test in <check_name> — pushing empty commit retry" >> $LOG_DIR/<story-id>.log
                git commit --allow-empty -m "ci: retry (suspected flaky)" (in worktree)
                git push origin <story_branch> (in worktree)
                sleep 60
                continue  # next cycle without applying code changes
        else:
            previous_failure_signatures[signature] = true

        # Step 6 — Analyze and fix (non-flaky failures)
        analyze failure_log to identify root cause
        apply minimal corrective code changes (in worktree — do not expand scope)
        git add → git commit -m "fix(ci/<story-id>): <description>" (in worktree)

    # Push all fixes
    git push origin <story_branch> (in worktree)

    # Step 7 — Wait then re-poll
    echo "[ci-watch-and-fix] fixes pushed — waiting 60s before re-poll" >> $LOG_DIR/<story-id>.log
    sleep 60

# Exhausted 3 cycles without CI PASS
goto ESCALATE

Heartbeat Rule

The daemon heartbeat monitor kills sessions silent for >30 minutes. This skill MUST emit at least one line to $LOG_DIR/<story-id>.log every 5 minutes during CI wait time. During the 60-second sleep between cycles, this is not an issue. If a single CI run takes >5 minutes to complete, emit periodic heartbeat lines:

# During long CI waits, poll every 60s and emit a heartbeat line each time
while ci_running:
    sleep 60
    echo "[ci-watch-and-fix] waiting for CI — elapsed: <N>s" >> $LOG_DIR/<story-id>.log
    check if checks are still in_progress

Escalation Path (AC3)

Trigger when: (cycle > 3 AND CI not passing, OR flaky retries exhausted) AND ci_is_advisory == false.

When ci_is_advisory == true, infra failures and exhausted retries produce CI PASS (advisory) — never CI FAIL. The merge proceeds. The escalation path below only applies when branch protection is active.

ESCALATE:
    # 1. Produce ci_remediation_report
    report_path = docs/ci-failures/<story-id>-<timestamp>.md
    write report containing:
        - story_id
        - pr_number
        - total_cycles_attempted
        - flaky_retry_used
        - per-cycle summary:
            - cycle number
            - checks that failed
            - failure log excerpt (last 500 chars)
            - fix attempted (or "flaky retry" / "none")
        - convergence_failure_reason: why CI did not converge

    # 2. Commit the report to the PR branch (in worktree)
    git add <report_path>
    git commit -m "ci(<story-id>): CI remediation report — convergence failed"
    git push origin <story_branch>

    # 3. Return CI FAIL — do NOT mark story done
    # The delivery wrapper's on_exit trap will mark the story failed (non-zero exit)
    return CI FAIL

NEVER mark the story done when returning CI FAIL.

Flaky Test Detection Heuristic (AC4)

A CI failure is classified as likely flaky if:

The same CI check fails in the current cycle AND
A previous cycle saw a failure in that same check with an identical error message (matched via the first 100 characters of the failure log for that check)

When a likely-flaky failure is detected:

Do NOT apply code changes
Push an empty commit to re-trigger CI: git commit --allow-empty -m "ci: retry (suspected flaky)"
Count this as consuming the flaky retry slot (max 1 flaky retry total per story)
If the flaky retry slot is already used and the same failure recurs → escalate

Fix Principles

When applying corrective code changes for non-flaky failures:

Minimal change only — fix what CI identifies, nothing more
No scope expansion — do not refactor, add features, or change behavior beyond the CI failure
Commit message convention: fix(ci/<story-id>): <description> — distinguishable from feature commits
Truncate logs: analyze only the last 3000 chars of each failed job log to stay within context limits

Outputs

CI PASS:

status: CI PASS
cycles_used: <n>
flaky_retry_used: <true|false>

CI PASS (advisory):

status: CI PASS
advisory: true
reason: <"no_checks" | "infra_failure_advisory">
note: "CI failed but branch protection is not active — merge permitted"

The Delivery Orchestrator treats CI PASS (advisory) identically to CI PASS — it proceeds to merge. The advisory flag is logged for traceability but does not block the delivery.

CI FAIL:

status: CI FAIL
cycles_used: 3
flaky_retry_used: <true|false>
escalation_reason: <why convergence failed>
remediation_report: docs/ci-failures/<story-id>-<timestamp>.md

CI FAIL is only returned when branch protection is active AND CI cannot pass. When branch protection is absent, infra failures produce CI PASS (advisory) instead.

Non-Goals

This skill must NOT:

Modify acceptance criteria or product scope
Apply fixes to pre-existing CI failures unrelated to this story's changes
Attempt to fix infrastructure failures (missing secrets, missing bindings, quota limits, billing limits) — these are detected via Step 4b fast-path (do NOT burn retry cycles). When ci_is_advisory, they produce CI PASS (advisory). When branch protection is active, they produce ESCALATE.
Merge the PR (that is the Orchestrator's responsibility after CI PASS)
Use gh pr checks --watch (heartbeat requirement — see AC7)

Quality Checks

Every cycle emits at least one heartbeat line to $LOG_DIR/<story-id>.log
Flaky detection compares against previous cycle signatures, not just the current cycle
Escalation report is committed before returning CI FAIL
Story is never marked done on CI FAIL
Log truncation is applied before analysis (max 3000 chars per job)

Maintainer

Fr-e-d Core maintainer

Source details

Full Name: Fr-e-d/GAAI-framework
Branch: main
Path in repo: .gaai/core/skills/delivery/ci-watch-and-fix
License: Other
Topics: claude-code ai-agents ai-coding codex-cli cursor gemini-cli agentic-coding context-engineering vibe-coding opencode autonomous-agents devtools windsurf ai-governance ai-developer-tools ai-memory-system

Featured Tools

Join Our Newsletter

Decompose a high-level delivery plan into a precise, file-level execution sequence with explicit ordering, edge cases, and test checkpoints. Activate after delivery-high-level-plan for complex or multi-phase Stories before implementation begins.

123 27

Explore

Didn't find tool you were looking for?

Search AI Tools