docs(agent-queue): add Phase 1 Slice 3 prompt (resilience & insights, single host)

2026-05-29 18:10:43 -07:00 · 2026-05-29 18:10:43 -07:00 · 1f18f5d7a3
commit 1f18f5d7a3
parent beb225162a
1 changed files with 168 additions and 0 deletions
--- a/agent-queue/docs/jobs/phase1-slice3.md
+++ b/agent-queue/docs/jobs/phase1-slice3.md
@ -0,0 +1,168 @@
+---
+engine: devin
+cwd: /Users/sd9235/code/mygh/learning_ai_devops_tools
+yolo: true
+lock: devops-tools
+timeout: 4h
+---
+
+ROLE: Senior engineer. Implement Phase 1 — Slice 3: RESILIENCE & INSIGHTS (single host).
+This is a LARGE, fully self-contained slice (git + log parsing only — NO network,
+NO external service, NO credentials) so it runs end-to-end without blockers.
+
+SOURCE OF TRUTH: agent-queue/docs/GIGAFACTORY_ROADMAP.md (read §11 lifecycle/retry,
+§25 durability/crash-recovery, §26 execution insights, §17 observability, §14 Phase 1).
+Implement the SINGLE-HOST bash equivalents of §25 and §26.
+
+PREREQUISITE / BRANCHING:
+- Builds on Slice 1 (PR #1, branch feat/gigafactory-p1-slice1).
+- Base on `main` IF PR #1 (and PR #2 if present) are merged; otherwise branch off
+  feat/gigafactory-p1-slice1. Do NOT revert or duplicate earlier slice code.
+- This slice is INDEPENDENT of Slice 2 (profiles/deps) — do not depend on it.
+- New branch: feat/gigafactory-p1-slice3. Commit in logical steps, push, open a PR.
+  DO NOT merge (human gate).
+
+STRICT SCOPE:
+- Edit ONLY under agent-queue/ (agent-queue.sh, selftest.sh, README.md,
+  docs/GIGAFACTORY_ROADMAP.md). No other repo.
+- DO NOT modify/delete anything under agent-queue/queue/ (live jobs). DO NOT run
+  `agent-queue.sh run` against the real queue. selftest.sh uses its own temp
+  AGENT_QUEUE_ROOT and temp git repos only.
+- bash, single host, macOS + Linux safe, zero new runtime deps.
+
+==================================================================
+A. CRASH RECOVERY & WORK PRESERVATION (single-host §25)
+==================================================================
+A1. ORPHAN RECOVERY: On `run` startup (and at the top of each run loop), detect
+    jobs stuck in building/ whose worker is no longer alive — i.e. the meta has a
+    `pid=` whose process is dead (and `pidstart` mismatch guards PID reuse), or no
+    live pid at all. Such a job is an ORPHAN from a previous crash/power-off.
+    Recover it deterministically (never lose or strand it):
+      - increment an `attempts=` counter in the meta,
+      - log a clear recovery line,
+      - move it back to inbox/ for re-selection (subject to retry policy A3),
+      - recovery MUST be idempotent (running it twice recovers once).
+
+A2. WIP CHECKPOINTING (work preservation): when a job's `cwd` is inside a git repo,
+    the worker preserves partial work on a dedicated branch so a crash never loses it:
+      - at START: ensure/create branch `aq/wip/<job>` (from current HEAD), record
+        `wip_branch=` + `wip_base=` in meta. NEVER touch main/protected branches.
+      - on EVERY exit path (success, failure, timeout, signal/trap): commit any
+        changes in cwd to `aq/wip/<job>` with a message like
+        "aq wip: <job> (<stage/exit>)" and record `wip_commit=` in meta.
+      - use a trap so even SIGTERM/SIGINT/timeout still checkpoints.
+      - if cwd is NOT a git repo: skip cleanly (log "wip: cwd not a git repo").
+    RESUME: when an orphan/retry of a job whose `aq/wip/<job>` branch exists is
+    relaunched, check out / fast-forward that branch first so the agent continues
+    from the checkpoint instead of from zero. Document the resume behavior.
+
+A3. RETRY POLICY (make the reserved `retry` field FUNCTIONAL):
+    parse `retry: { max: N, backoff: 5m, on: [timeout, verify_failed, crash] }`.
+    On a failure whose class is in `on` (agent rc!=0 => crash/agent_error,
+    timeout => timeout, verify fail => verify_failed), requeue to inbox/ with the
+    backoff delay honored (record `next_eligible=` epoch; selection skips until
+    then) up to `max` attempts. On exhaustion → failed/ with
+    result=retries_exhausted (single-host stand-in for dead_letter), preserving the
+    wip branch + full diagnostics in the log. Default when `retry` absent = no
+    retry (current behavior).
+
+A4. STATE INTEGRITY: keep all meta writes append-only (as today); never truncate a
+    live meta. Recovery/retry/backoff bookkeeping must be crash-safe (re-derivable
+    from meta + folder location).
+
+==================================================================
+B. EXECUTION INSIGHTS & TOKEN ACCOUNTING (single-host §26)
+==================================================================
+B1. PER-RUN METRICS: on completion, record into the job meta:
+      duration_s, exit, result, attempts, and repo deltas for the run —
+      files_changed, lines_added, lines_deleted (from `git -C <cwd> diff --numstat`
+      against wip_base, or against HEAD~ if applicable).
+B2. TOKEN/COST CAPTURE (best-effort, honest): add a single extensible adapter
+      `parse_usage <engine> <logfile>` that extracts, when present in the engine's
+      output: model, tokens_in, tokens_out, tokens_cached, cost_usd, turns,
+      tool_calls. Where the engine does not expose usage, omit the field or set an
+      `estimated=true` marker — DO NOT fabricate precise numbers. Centralize all
+      per-engine patterns in this one function (devin/claude/codex/copilot stubs;
+      real patterns where known, TODO-commented otherwise).
+B3. SURFACE in `status`: add an insights sub-line per finished/running job
+      (duration, attempts, tokens/cost if known, +/- lines).
+B4. NEW COMMAND `aq insights [job]`:
+      - with a job id: print that job's full metrics.
+      - without: print a table of recent finished jobs + an AGGREGATE rollup by
+        engine (total tokens, total cost (mark if any estimated), job count,
+        success rate, avg duration).
+B5. dashboard.mjs: surface a compact insights column/panel (tokens or cost +
+      attempts) for finished jobs. Keep it read-only from meta (agent-queue.sh
+      stays the single source of truth).
+B6. PRIVACY: never write prompt content or secrets into meta/insights/logs beyond
+      what already exists.
+
+==================================================================
+TESTS (selftest.sh — tests are sacred; only ADD; use temp git repos + stubs)
+==================================================================
+- orphan recovery: craft a building/ job whose meta pid is a dead PID → a `run`
+  startup recovers it to inbox/ with attempts incremented; running recovery twice
+  recovers exactly once.
+- wip checkpoint (git): job with a git-repo cwd that creates a file → after the
+  run, branch aq/wip/<job> exists and contains a commit with the change; main
+  branch untouched. Non-git cwd → skipped cleanly (no error).
+- wip resume: a recovered job whose aq/wip/<job> has a prior commit → the relaunch
+  checks out that branch (assert HEAD is on aq/wip/<job> when the agent runs).
+- retry policy: verify-fail job with retry.max=1 on=[verify_failed] → requeued once
+  (attempts=2) then → failed/ result=retries_exhausted; backoff next_eligible
+  respected (job not picked before its delay — use a tiny backoff like 1s).
+- retry on crash: agent rc!=0 with on=[crash] retries; without `crash` in `on`,
+  it goes straight to failed/ (no retry).
+- insights parse: feed a stub engine log containing a known usage line →
+  parse_usage extracts tokens/cost into meta; `aq insights <job>` prints them;
+  a no-usage log → fields omitted/estimated, no crash.
+- insights aggregate: two finished jobs → `aq insights` prints a per-engine rollup
+  with correct totals + success rate.
+- numstat deltas: a run that adds N lines → lines_added recorded.
+- REGRESSION: all existing selftest cases (Slice 0 + Slice 1) still green.
+
+==================================================================
+DOCS
+==================================================================
+- README: new "Resilience" section (orphan recovery, WIP checkpoint/resume, retry)
+  and "Insights" section (metrics, `aq insights`, token caveat) + document the
+  `retry` frontmatter (now active) and the new result= values
+  (retries_exhausted). Update the manifest table: move `retry` from RESERVED to ACTIVE.
+- docs/GIGAFACTORY_ROADMAP.md: tick the single-host items you fully completed in
+  §11 (retry/dead-letter stand-in), §25 (orphan/WIP/retry — note "single-host
+  subset"), §26 (capture/insights — single-host subset); bump §0 Phase 1 %.
+
+==================================================================
+CONSTRAINTS
+==================================================================
+- bash style consistent with the existing script; no new runtime deps; mac+linux
+  safe (no GNU-only flags without a fallback — note macOS has BSD date/stat);
+  no emojis in code; no leftover debug noise; conventional commits.
+- Be careful with `set -euo pipefail` + traps so the WIP-on-exit checkpoint always
+  runs even on failure/timeout.
+
+VERIFY GATE (must pass before finishing):
+- bash agent-queue/selftest.sh → fully green (existing + all new cases).
+- bash -n agent-queue/agent-queue.sh ; node --check agent-queue/dashboard.mjs.
+- shellcheck --severity=error agent-queue/agent-queue.sh (if available) → clean.
+
+FINAL OUTPUT — print the implementation report in EXACTLY this format:
+
+## Implementation Report — Phase 1 Slice 3
+### Branch & commits
+- branch / based-on: <name> (based on main | feat/gigafactory-p1-slice1)
+- commits: <sha> <message> (one per line)
+- PR: <url or "opened, not merged">
+### Files changed
+- <path>: <one-line summary>
+### What was implemented (A1-A4, B1-B6)
+- <item>: <how, key functions added/changed>
+### Tests added
+- <test name>: <what it asserts>  (plus selftest.sh PASS/FAIL summary)
+### Verify gate results
+- selftest.sh: <PASS/FAIL + counts>
+- bash -n / node --check / shellcheck: <result>
+### Deviations / assumptions
+- <anything changed from spec and why; which engines have real token parsing vs TODO>
+### Suggested next slice
+- <what should come next (likely: tracker adapter aq from-tracker/to-tracker, P2)>