From 1f18f5d7a33421e12a518b3d33ad77bae1f3cb0b Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Fri, 29 May 2026 18:10:43 -0700 Subject: [PATCH] docs(agent-queue): add Phase 1 Slice 3 prompt (resilience & insights, single host) --- agent-queue/docs/jobs/phase1-slice3.md | 168 +++++++++++++++++++++++++ 1 file changed, 168 insertions(+) create mode 100644 agent-queue/docs/jobs/phase1-slice3.md diff --git a/agent-queue/docs/jobs/phase1-slice3.md b/agent-queue/docs/jobs/phase1-slice3.md new file mode 100644 index 0000000..b263ac0 --- /dev/null +++ b/agent-queue/docs/jobs/phase1-slice3.md @@ -0,0 +1,168 @@ +--- +engine: devin +cwd: /Users/sd9235/code/mygh/learning_ai_devops_tools +yolo: true +lock: devops-tools +timeout: 4h +--- + +ROLE: Senior engineer. Implement Phase 1 — Slice 3: RESILIENCE & INSIGHTS (single host). +This is a LARGE, fully self-contained slice (git + log parsing only — NO network, +NO external service, NO credentials) so it runs end-to-end without blockers. + +SOURCE OF TRUTH: agent-queue/docs/GIGAFACTORY_ROADMAP.md (read §11 lifecycle/retry, +§25 durability/crash-recovery, §26 execution insights, §17 observability, §14 Phase 1). +Implement the SINGLE-HOST bash equivalents of §25 and §26. + +PREREQUISITE / BRANCHING: +- Builds on Slice 1 (PR #1, branch feat/gigafactory-p1-slice1). +- Base on `main` IF PR #1 (and PR #2 if present) are merged; otherwise branch off + feat/gigafactory-p1-slice1. Do NOT revert or duplicate earlier slice code. +- This slice is INDEPENDENT of Slice 2 (profiles/deps) — do not depend on it. +- New branch: feat/gigafactory-p1-slice3. Commit in logical steps, push, open a PR. + DO NOT merge (human gate). + +STRICT SCOPE: +- Edit ONLY under agent-queue/ (agent-queue.sh, selftest.sh, README.md, + docs/GIGAFACTORY_ROADMAP.md). No other repo. +- DO NOT modify/delete anything under agent-queue/queue/ (live jobs). DO NOT run + `agent-queue.sh run` against the real queue. selftest.sh uses its own temp + AGENT_QUEUE_ROOT and temp git repos only. +- bash, single host, macOS + Linux safe, zero new runtime deps. + +================================================================== +A. CRASH RECOVERY & WORK PRESERVATION (single-host §25) +================================================================== +A1. ORPHAN RECOVERY: On `run` startup (and at the top of each run loop), detect + jobs stuck in building/ whose worker is no longer alive — i.e. the meta has a + `pid=` whose process is dead (and `pidstart` mismatch guards PID reuse), or no + live pid at all. Such a job is an ORPHAN from a previous crash/power-off. + Recover it deterministically (never lose or strand it): + - increment an `attempts=` counter in the meta, + - log a clear recovery line, + - move it back to inbox/ for re-selection (subject to retry policy A3), + - recovery MUST be idempotent (running it twice recovers once). + +A2. WIP CHECKPOINTING (work preservation): when a job's `cwd` is inside a git repo, + the worker preserves partial work on a dedicated branch so a crash never loses it: + - at START: ensure/create branch `aq/wip/` (from current HEAD), record + `wip_branch=` + `wip_base=` in meta. NEVER touch main/protected branches. + - on EVERY exit path (success, failure, timeout, signal/trap): commit any + changes in cwd to `aq/wip/` with a message like + "aq wip: ()" and record `wip_commit=` in meta. + - use a trap so even SIGTERM/SIGINT/timeout still checkpoints. + - if cwd is NOT a git repo: skip cleanly (log "wip: cwd not a git repo"). + RESUME: when an orphan/retry of a job whose `aq/wip/` branch exists is + relaunched, check out / fast-forward that branch first so the agent continues + from the checkpoint instead of from zero. Document the resume behavior. + +A3. RETRY POLICY (make the reserved `retry` field FUNCTIONAL): + parse `retry: { max: N, backoff: 5m, on: [timeout, verify_failed, crash] }`. + On a failure whose class is in `on` (agent rc!=0 => crash/agent_error, + timeout => timeout, verify fail => verify_failed), requeue to inbox/ with the + backoff delay honored (record `next_eligible=` epoch; selection skips until + then) up to `max` attempts. On exhaustion → failed/ with + result=retries_exhausted (single-host stand-in for dead_letter), preserving the + wip branch + full diagnostics in the log. Default when `retry` absent = no + retry (current behavior). + +A4. STATE INTEGRITY: keep all meta writes append-only (as today); never truncate a + live meta. Recovery/retry/backoff bookkeeping must be crash-safe (re-derivable + from meta + folder location). + +================================================================== +B. EXECUTION INSIGHTS & TOKEN ACCOUNTING (single-host §26) +================================================================== +B1. PER-RUN METRICS: on completion, record into the job meta: + duration_s, exit, result, attempts, and repo deltas for the run — + files_changed, lines_added, lines_deleted (from `git -C diff --numstat` + against wip_base, or against HEAD~ if applicable). +B2. TOKEN/COST CAPTURE (best-effort, honest): add a single extensible adapter + `parse_usage ` that extracts, when present in the engine's + output: model, tokens_in, tokens_out, tokens_cached, cost_usd, turns, + tool_calls. Where the engine does not expose usage, omit the field or set an + `estimated=true` marker — DO NOT fabricate precise numbers. Centralize all + per-engine patterns in this one function (devin/claude/codex/copilot stubs; + real patterns where known, TODO-commented otherwise). +B3. SURFACE in `status`: add an insights sub-line per finished/running job + (duration, attempts, tokens/cost if known, +/- lines). +B4. NEW COMMAND `aq insights [job]`: + - with a job id: print that job's full metrics. + - without: print a table of recent finished jobs + an AGGREGATE rollup by + engine (total tokens, total cost (mark if any estimated), job count, + success rate, avg duration). +B5. dashboard.mjs: surface a compact insights column/panel (tokens or cost + + attempts) for finished jobs. Keep it read-only from meta (agent-queue.sh + stays the single source of truth). +B6. PRIVACY: never write prompt content or secrets into meta/insights/logs beyond + what already exists. + +================================================================== +TESTS (selftest.sh — tests are sacred; only ADD; use temp git repos + stubs) +================================================================== +- orphan recovery: craft a building/ job whose meta pid is a dead PID → a `run` + startup recovers it to inbox/ with attempts incremented; running recovery twice + recovers exactly once. +- wip checkpoint (git): job with a git-repo cwd that creates a file → after the + run, branch aq/wip/ exists and contains a commit with the change; main + branch untouched. Non-git cwd → skipped cleanly (no error). +- wip resume: a recovered job whose aq/wip/ has a prior commit → the relaunch + checks out that branch (assert HEAD is on aq/wip/ when the agent runs). +- retry policy: verify-fail job with retry.max=1 on=[verify_failed] → requeued once + (attempts=2) then → failed/ result=retries_exhausted; backoff next_eligible + respected (job not picked before its delay — use a tiny backoff like 1s). +- retry on crash: agent rc!=0 with on=[crash] retries; without `crash` in `on`, + it goes straight to failed/ (no retry). +- insights parse: feed a stub engine log containing a known usage line → + parse_usage extracts tokens/cost into meta; `aq insights ` prints them; + a no-usage log → fields omitted/estimated, no crash. +- insights aggregate: two finished jobs → `aq insights` prints a per-engine rollup + with correct totals + success rate. +- numstat deltas: a run that adds N lines → lines_added recorded. +- REGRESSION: all existing selftest cases (Slice 0 + Slice 1) still green. + +================================================================== +DOCS +================================================================== +- README: new "Resilience" section (orphan recovery, WIP checkpoint/resume, retry) + and "Insights" section (metrics, `aq insights`, token caveat) + document the + `retry` frontmatter (now active) and the new result= values + (retries_exhausted). Update the manifest table: move `retry` from RESERVED to ACTIVE. +- docs/GIGAFACTORY_ROADMAP.md: tick the single-host items you fully completed in + §11 (retry/dead-letter stand-in), §25 (orphan/WIP/retry — note "single-host + subset"), §26 (capture/insights — single-host subset); bump §0 Phase 1 %. + +================================================================== +CONSTRAINTS +================================================================== +- bash style consistent with the existing script; no new runtime deps; mac+linux + safe (no GNU-only flags without a fallback — note macOS has BSD date/stat); + no emojis in code; no leftover debug noise; conventional commits. +- Be careful with `set -euo pipefail` + traps so the WIP-on-exit checkpoint always + runs even on failure/timeout. + +VERIFY GATE (must pass before finishing): +- bash agent-queue/selftest.sh → fully green (existing + all new cases). +- bash -n agent-queue/agent-queue.sh ; node --check agent-queue/dashboard.mjs. +- shellcheck --severity=error agent-queue/agent-queue.sh (if available) → clean. + +FINAL OUTPUT — print the implementation report in EXACTLY this format: + +## Implementation Report — Phase 1 Slice 3 +### Branch & commits +- branch / based-on: (based on main | feat/gigafactory-p1-slice1) +- commits: (one per line) +- PR: +### Files changed +- : +### What was implemented (A1-A4, B1-B6) +- : +### Tests added +- : (plus selftest.sh PASS/FAIL summary) +### Verify gate results +- selftest.sh: +- bash -n / node --check / shellcheck: +### Deviations / assumptions +- +### Suggested next slice +-