--- engine: devin cwd: /Users/sd9235/code/mygh/learning_ai_devops_tools yolo: true lock: devops-tools timeout: 4h --- ROLE: Senior engineer. Implement Phase 1 — Slice 3: RESILIENCE & INSIGHTS (single host). This is a LARGE, fully self-contained slice (git + log parsing only — NO network, NO external service, NO credentials) so it runs end-to-end without blockers. SOURCE OF TRUTH: agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md (read §11 lifecycle/retry, §25 durability/crash-recovery, §26 execution insights, §17 observability, §14 Phase 1). Implement the SINGLE-HOST bash equivalents of §25 and §26. PREREQUISITE / BRANCHING: - Builds on Slice 1 (PR #1, branch feat/gigafactory-p1-slice1). - Base on `main` IF PR #1 (and PR #2 if present) are merged; otherwise branch off feat/gigafactory-p1-slice1. Do NOT revert or duplicate earlier slice code. - This slice is INDEPENDENT of Slice 2 (profiles/deps) — do not depend on it. - New branch: feat/gigafactory-p1-slice3. Commit in logical steps, push, open a PR. DO NOT merge (human gate). STRICT SCOPE: - Edit ONLY under agent-queue/ (agent-queue.sh, selftest.sh, README.md, docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md). No other repo. - DO NOT modify/delete anything under agent-queue/queue/ (live jobs). DO NOT run `agent-queue.sh run` against the real queue. selftest.sh uses its own temp AGENT_QUEUE_ROOT and temp git repos only. - bash, single host, macOS + Linux safe, zero new runtime deps. ================================================================== A. CRASH RECOVERY & WORK PRESERVATION (single-host §25) ================================================================== A1. ORPHAN RECOVERY: On `run` startup (and at the top of each run loop), detect jobs stuck in building/ whose worker is no longer alive — i.e. the meta has a `pid=` whose process is dead (and `pidstart` mismatch guards PID reuse), or no live pid at all. Such a job is an ORPHAN from a previous crash/power-off. Recover it deterministically (never lose or strand it): - increment an `attempts=` counter in the meta, - log a clear recovery line, - move it back to inbox/ for re-selection (subject to retry policy A3), - recovery MUST be idempotent (running it twice recovers once). A2. WIP CHECKPOINTING (work preservation): when a job's `cwd` is inside a git repo, the worker preserves partial work on a dedicated branch so a crash never loses it: - at START: ensure/create branch `aq/wip/` (from current HEAD), record `wip_branch=` + `wip_base=` in meta. NEVER touch main/protected branches. - on EVERY exit path (success, failure, timeout, signal/trap): commit any changes in cwd to `aq/wip/` with a message like "aq wip: ()" and record `wip_commit=` in meta. - use a trap so even SIGTERM/SIGINT/timeout still checkpoints. - if cwd is NOT a git repo: skip cleanly (log "wip: cwd not a git repo"). RESUME: when an orphan/retry of a job whose `aq/wip/` branch exists is relaunched, check out / fast-forward that branch first so the agent continues from the checkpoint instead of from zero. Document the resume behavior. A3. RETRY POLICY (make the reserved `retry` field FUNCTIONAL): parse `retry: { max: N, backoff: 5m, on: [timeout, verify_failed, crash] }`. On a failure whose class is in `on` (agent rc!=0 => crash/agent_error, timeout => timeout, verify fail => verify_failed), requeue to inbox/ with the backoff delay honored (record `next_eligible=` epoch; selection skips until then) up to `max` attempts. On exhaustion → failed/ with result=retries_exhausted (single-host stand-in for dead_letter), preserving the wip branch + full diagnostics in the log. Default when `retry` absent = no retry (current behavior). A4. STATE INTEGRITY: keep all meta writes append-only (as today); never truncate a live meta. Recovery/retry/backoff bookkeeping must be crash-safe (re-derivable from meta + folder location). ================================================================== B. EXECUTION INSIGHTS & TOKEN ACCOUNTING (single-host §26) ================================================================== B1. PER-RUN METRICS: on completion, record into the job meta: duration_s, exit, result, attempts, and repo deltas for the run — files_changed, lines_added, lines_deleted (from `git -C diff --numstat` against wip_base, or against HEAD~ if applicable). B2. TOKEN/COST CAPTURE (best-effort, honest): add a single extensible adapter `parse_usage ` that extracts, when present in the engine's output: model, tokens_in, tokens_out, tokens_cached, cost_usd, turns, tool_calls. Where the engine does not expose usage, omit the field or set an `estimated=true` marker — DO NOT fabricate precise numbers. Centralize all per-engine patterns in this one function (devin/claude/codex/copilot stubs; real patterns where known, TODO-commented otherwise). B3. SURFACE in `status`: add an insights sub-line per finished/running job (duration, attempts, tokens/cost if known, +/- lines). B4. NEW COMMAND `aq insights [job]`: - with a job id: print that job's full metrics. - without: print a table of recent finished jobs + an AGGREGATE rollup by engine (total tokens, total cost (mark if any estimated), job count, success rate, avg duration). B5. dashboard.mjs: surface a compact insights column/panel (tokens or cost + attempts) for finished jobs. Keep it read-only from meta (agent-queue.sh stays the single source of truth). B6. PRIVACY: never write prompt content or secrets into meta/insights/logs beyond what already exists. ================================================================== TESTS (selftest.sh — tests are sacred; only ADD; use temp git repos + stubs) ================================================================== - orphan recovery: craft a building/ job whose meta pid is a dead PID → a `run` startup recovers it to inbox/ with attempts incremented; running recovery twice recovers exactly once. - wip checkpoint (git): job with a git-repo cwd that creates a file → after the run, branch aq/wip/ exists and contains a commit with the change; main branch untouched. Non-git cwd → skipped cleanly (no error). - wip resume: a recovered job whose aq/wip/ has a prior commit → the relaunch checks out that branch (assert HEAD is on aq/wip/ when the agent runs). - retry policy: verify-fail job with retry.max=1 on=[verify_failed] → requeued once (attempts=2) then → failed/ result=retries_exhausted; backoff next_eligible respected (job not picked before its delay — use a tiny backoff like 1s). - retry on crash: agent rc!=0 with on=[crash] retries; without `crash` in `on`, it goes straight to failed/ (no retry). - insights parse: feed a stub engine log containing a known usage line → parse_usage extracts tokens/cost into meta; `aq insights ` prints them; a no-usage log → fields omitted/estimated, no crash. - insights aggregate: two finished jobs → `aq insights` prints a per-engine rollup with correct totals + success rate. - numstat deltas: a run that adds N lines → lines_added recorded. - REGRESSION: all existing selftest cases (Slice 0 + Slice 1) still green. ================================================================== DOCS ================================================================== - README: new "Resilience" section (orphan recovery, WIP checkpoint/resume, retry) and "Insights" section (metrics, `aq insights`, token caveat) + document the `retry` frontmatter (now active) and the new result= values (retries_exhausted). Update the manifest table: move `retry` from RESERVED to ACTIVE. - docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md: tick the single-host items you fully completed in §11 (retry/dead-letter stand-in), §25 (orphan/WIP/retry — note "single-host subset"), §26 (capture/insights — single-host subset); bump §0 Phase 1 %. ================================================================== CONSTRAINTS ================================================================== - bash style consistent with the existing script; no new runtime deps; mac+linux safe (no GNU-only flags without a fallback — note macOS has BSD date/stat); no emojis in code; no leftover debug noise; conventional commits. - Be careful with `set -euo pipefail` + traps so the WIP-on-exit checkpoint always runs even on failure/timeout. VERIFY GATE (must pass before finishing): - bash agent-queue/selftest.sh → fully green (existing + all new cases). - bash -n agent-queue/agent-queue.sh ; node --check agent-queue/dashboard.mjs. - shellcheck --severity=error agent-queue/agent-queue.sh (if available) → clean. FINAL OUTPUT — print the implementation report in EXACTLY this format: ## Implementation Report — Phase 1 Slice 3 ### Branch & commits - branch / based-on: (based on main | feat/gigafactory-p1-slice1) - commits: (one per line) - PR: ### Files changed - : ### What was implemented (A1-A4, B1-B6) - : ### Tests added - : (plus selftest.sh PASS/FAIL summary) ### Verify gate results - selftest.sh: - bash -n / node --check / shellcheck: ### Deviations / assumptions - ### Suggested next slice -