docs(agent-queue): Resilience + Insights docs; tick §11/§25/§26 single-host (P1-S3)

README: Resilience + Insights sections, retry frontmatter active (manifest table), retries_exhausted/recovered result values, recover/insights commands, honest token caveat. Roadmap: tick fully-completed single-host boxes in §11/§25/§26 with annotations; bump §0 Phase 1 to 55%.
2026-05-29 18:43:38 -07:00 · 2026-05-29 18:43:38 -07:00 · 87a4bf591a
commit 87a4bf591a
parent f46dd38adb
2 changed files with 77 additions and 21 deletions
--- a/agent-queue/README.md
+++ b/agent-queue/README.md
@ -113,7 +113,7 @@ are otherwise **no-ops until a later phase** (they do not yet affect execution).
 | `prefers` | RESERVED | _(none)_ | soft routing/affinity hints (e.g. `[factory:mac-2]`) |
 | `budget` | RESERVED | _(none)_ | `{ usd, tokens, wall }` ceilings (`wall` enforcement is a later slice) |
 | `deps` / `deps-mode` | RESERVED | _(none)_ | DAG dependencies (single-host blocking is a later slice) |
-| `retry` | RESERVED | _(none)_ | `{ max, backoff, on }` retry policy |
+| `retry` | **active** | _(none)_ | `{ max: N, backoff: 5m, on: [timeout, verify_failed, crash] }` — requeue failures with backoff up to `max`, then `retries_exhausted` (see **Resilience**) |
 | `review-policy` | RESERVED | _(none)_ | `auto\|manual\|reviewers:[…]` |
 | `artifacts` | RESERVED | _(none)_ | extra outputs to capture (coverage, screenshots) |
 | `tracker-item` | RESERVED | _(none)_ | link back to the originating tracker task |
@ -160,8 +160,10 @@ misparsed as a flag.
 | `init` | create the `queue/` folders |
 | `add <file> [--engine E] [--cwd P] [--yolo\|--no-yolo]` | queue a prompt into `inbox/` |
 | `run [--max N] [--engine E] [--once]` | process the inbox (foreground loop) |
-| `status` | kanban counts + running-worker table (marks `⚠ stalled` workers) |
+| `status` | kanban counts + running-worker table (marks `⚠ stalled`; per-job insights sub-line) |
 | `watch [interval]` | live `status` (bash), redrawn every N seconds (default 2) |
+| `insights [job]` | per-job metrics, or a recent-jobs table + per-engine token/cost/success rollup (see **Insights**) |
+| `recover` | reclaim orphaned `building/` jobs (dead worker) back to `inbox/` (see **Resilience**) |
 | `dash [--interval N]` | **interactive Node dashboard** — navigable numbered job list with single-key actions (see below) |
 | `stop` | kill running workers + the run loop |
 | `logs <job> [-f]` | print / follow a job's log |
@ -225,7 +227,59 @@ queue/
 **`result=` values** written to `<job>.meta`: `review`, `testing`, `shipped`,
 `failed`, `timeout`, `verify_failed`, `rejected`, `requeued`, `capability_mismatch`
 (host missing a required capability — agent never launched), `no_engine`
-(an `engine-class` had no available engine).
+(an `engine-class` had no available engine), `retries_exhausted` (failed after
+`retry.max` attempts — single-host dead-letter stand-in), `retry_scheduled`
+(transient: requeued for another attempt), `recovered` (transient: an orphan was
+reclaimed to `inbox/`).
+
+## Resilience (crash recovery & work preservation)
+
+Single-host implementations of the durability model (roadmap §25):
+
+- **Orphan recovery.** A job left in `building/` whose worker process is dead (no
+  live `pid`, PID-reuse-guarded by `pidstart`) is an orphan from a previous
+  crash/power-off. On `run` startup and on every loop iteration (or on demand via
+  `agent-queue.sh recover`) it is moved back to `inbox/` with `attempts`
+  incremented. Recovery is **idempotent** — once moved out of `building/` it is
+  never recovered twice.
+- **WIP checkpointing.** When a job's `cwd` is a git repo, the worker creates/checks
+  out a dedicated branch **`aq/wip/<job>`** at start and commits any changes to it
+  on **every** exit path — success, failure, timeout, and SIGTERM/SIGINT (via a
+  trap). It **never** commits to `main`/your current branch. Non-git `cwd` is
+  skipped cleanly. `wip_branch` / `wip_base` / `wip_commit` are recorded in the meta.
+- **Resume.** When an orphan/retry of a job whose `aq/wip/<job>` branch already
+  exists is relaunched, that branch is checked out first so the agent **continues
+  from the checkpoint** instead of from zero.
+- **Retry policy** (`retry` frontmatter, now active). On a failure whose class is in
+  `on` (`crash`/`agent_error` for a non-zero agent exit, `timeout`, `verify_failed`)
+  the job is requeued to `inbox/` honoring `backoff` (selection skips it until
+  `next_eligible`) up to `max` attempts; on exhaustion it lands in `failed/` with
+  `result=retries_exhausted`, preserving the WIP branch + full log. No `retry` =
+  no retry (Phase-0 behavior).
+
+All bookkeeping (`attempts`, `next_eligible`, `wip_*`) is append-only in the meta
+and re-derivable from the meta + folder location, so recovery is crash-safe.
+
+## Insights (metrics & token accounting)
+
+Each finished run records into `<job>.meta`: `duration_s`, `exit`, `result`,
+`attempts`, and — for a git `cwd` — `files_changed` / `lines_added` /
+`lines_deleted` (diffed `wip_base..HEAD`). A single `parse_usage <engine> <log>`
+adapter extracts `model` / `tokens_in` / `tokens_out` / `tokens_cached` /
+`cost_usd` / `turns` / `tool_calls` when the engine exposes them.
+
+```bash
+agent-queue.sh insights <job>   # full metrics for one job
+agent-queue.sh insights         # recent-jobs table + per-engine rollup
+```
+
+> **Token caveat (honest):** real usage is captured only where the engine surfaces
+> it. A cooperating wrapper may emit a machine-readable `AQ_USAGE key=value …` line;
+> otherwise per-engine heuristics apply (Claude/Codex token fields parsed; Devin
+> session metrics + Copilot are API-only and currently TODO in `parse_usage`). When
+> a value is not provider-reported it is **omitted or flagged `usage_estimated`** —
+> numbers are never fabricated. The per-engine rollup marks totals that include any
+> estimated value with `*`.

 ## Config (env overrides)

--- a/agent-queue/docs/GIGAFACTORY_ROADMAP.md
+++ b/agent-queue/docs/GIGAFACTORY_ROADMAP.md
@ -11,7 +11,7 @@
 | Phase | Theme | Status | % | Gate |
 | ----- | ----- | ------ | - | ---- |
 | **0** | Baseline (today) | ✅ shipped | 100% | `selftest.sh` green |
-| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 35% | adapter e2e + selftest |
+| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 55% | adapter e2e + selftest |
 | **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ☐ not started | 0% | fleet e2e + module tests |
 | **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests |
 | **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite |
@ -289,10 +289,10 @@ Three transports were evaluated. **Decision: platform-service-native coordinator
 - [ ] Canonical stages enforced server-side: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`); transitions validated (illegal transition → 409).
 - [ ] Per-profile default `verify`; per-job override; verify runs at the factory, result reported as an event.
 - [ ] Human gates: `review-policy` routes to reviewers; multi-reviewer support (P3).
- [ ] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped.
+- [x] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped. *(P1-S3 single-host stand-in: `failed/` `result=retries_exhausted`, WIP branch + full log preserved.)*
 - [ ] **Backpressure**: when no factory can take more, jobs stay `queued` (no thrash); SLA timers visible.
 - [ ] **Ship semantics** are profile-configurable (merged+green vs `pr-opened`, §10); `shipped` is terminal-success, `dead_letter` terminal-failure; `blocked` (unmet deps) is distinct from `queued`.
- [ ] **Retry vs idempotency**: a retry creates a new `fleet_runs` attempt under the same job/`idempotency-key` (never a duplicate job); backoff honored; `retry.on` filters which failure classes retry.
+- [x] **Retry vs idempotency**: a retry creates a new `fleet_runs` attempt under the same job/`idempotency-key` (never a duplicate job); backoff honored; `retry.on` filters which failure classes retry. *(P1-S3 single-host: `attempts` counter survives requeue; `backoff`→`next_eligible` gates selection; `on` filters timeout/verify_failed/crash.)*
 - **Acceptance:** a perpetually-failing job lands in `dead_letter` after configured retries; a passing one auto-advances to `testing` then waits for human `ship`; an illegal transition is rejected.
 - **Verify gate:** lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path).

@ -340,14 +340,16 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until
 ### Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)
 **Goal:** richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.

-> **Slice progress — P1-S1 (this commit):** manifest parsing (all §5 fields, defaulted + backward-compatible), `priority` ordering, capability detection+match gate, `engine-class` resolution, and `idempotency-key` dedupe are **done** on the bash runner. Profiles, `deps` DAG, `retry`/`budget.wall`, `allowed-scope`, the tracker adapter, and dashboard surfacing remain **for later slices**.
+> **Slice progress — P1-S1:** manifest parsing (all §5 fields, defaulted + backward-compatible), `priority` ordering, capability detection+match gate, `engine-class` resolution, and `idempotency-key` dedupe are **done** on the bash runner.
+>
+> **Slice progress — P1-S3 (resilience & insights, single host):** crash recovery (`recover_orphans` + `aq recover`), git WIP checkpoint/resume (`aq/wip/<job>`), functional `retry` policy (backoff + `retries_exhausted`), and execution insights (`parse_usage`, per-run metrics in meta, `aq insights`, `status`/`dash` insights) are **done** — see §11/§25/§26. Profiles, `deps` DAG, `budget.wall`, `allowed-scope`, and the tracker adapter remain **for later slices**.

 - [x] Extend `agent-queue.sh` frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible. *(P1-S1)*
 - [ ] Add `profiles/` directory + profile resolution (persona injection, default verify/caps/scope) (§6).
 - [x] Local capability detection + a job/factory capability match check before launch (§8 subset). *(P1-S1: `detect_capabilities` + `caps_match`; mismatch ⇒ `failed/` `result=capability_mismatch`, agent never launched.)*
 - [x] `priority` ordering in the inbox pick (replace pure FIFO with priority-then-age). *(P1-S1: `inbox_sorted`; per-lock serialization preserved.)*
 - [ ] `deps` (DAG) blocking on a single host; `idempotency-key` dedupe on `add`. *(P1-S1: `idempotency-key` dedupe DONE; `deps` DAG blocking still pending.)*
- [ ] `retry` with backoff into `failed`/requeue; `budget.wall` enforced (extends `timeout`).
+- [ ] `retry` with backoff into `failed`/requeue; `budget.wall` enforced (extends `timeout`). *(P1-S3: `retry` with backoff + `retries_exhausted` DONE; `budget.wall` still pending.)*
 - [ ] `allowed-scope` guardrail (warn-only this phase) + post-run diff report.
 - [ ] **Tracker adapter** `aq from-tracker <ITEM>` + `aq to-tracker` event poster (§10 P1).
 - [ ] Dashboard shows profile + priority + capability tags + tracker-item link. *(P1-S1: `status` shows priority/profile/caps/tracker-item; Node `dash` surfacing pending.)*
@ -618,16 +620,16 @@ Composite work obeys the same SoT discipline as the core contract (§4 immutable
 - [ ] A factory only ever holds a **transient materialized copy** (temp prompt file) fetched from the API — losing the factory loses nothing. On the offline edge, the `.md` file on disk is the durable copy and reconciles on reconnect (§9).

 ### 25.2 Work-in-progress is preserved (checkpointing)
- [ ] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/<jobId>`** at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to `main`/protected branches (§12 push policy).
- [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it.
- [ ] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window.
+- [x] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/<jobId>`** at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to `main`/protected branches (§12 push policy). *(P1-S3: `_wip_start`/`_wip_checkpoint` + EXIT/INT/TERM trap; non-git cwd skipped.)*
+- [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it. *(P2 Cosmos; single-host records `wip_branch`/`wip_base`/`wip_commit` in `<job>.meta`.)*
+- [x] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window. *(P1-S3: start + every-exit-path commits bound the loss window.)*

 ### 25.3 Recovery is automatic, resumable, and fenced
- [ ] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded.
- [ ] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/<jobId>` exists, the new worker **resumes from the checkpoint** instead of restarting from zero.
- [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of *visible* outcomes.
- [ ] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion → `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics — never silently dropped.
- [ ] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** (§13); recovery is idempotent (running it twice yields one recovery).
+- [x] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded. *(P1-S3: `recover_orphans` on `run` startup + each loop, and `agent-queue.sh recover`; dead-pid + `pidstart` reuse guard.)*
+- [x] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/<jobId>` exists, the new worker **resumes from the checkpoint** instead of restarting from zero. *(P1-S3: relaunch checks out `aq/wip/<job>`; `attempts` incremented.)*
+- [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of *visible* outcomes. *(P2 — distributed leasing; out of single-host scope.)*
+- [x] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion → `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics — never silently dropped. *(P1-S3 single-host.)*
+- [x] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** (§13); recovery is idempotent (running it twice yields one recovery). *(P1-S3 single-host: meta is append-only + re-derivable from folder location; `_etag` guard is P2.)*

 ### 25.4 Crash taxonomy (all handled)
 | Failure | Detection | Recovery |
@ -646,11 +648,11 @@ Composite work obeys the same SoT discipline as the core contract (§4 immutable

 **Goal:** per-job/run visibility into **token usage, cost, model, latency, and tool activity** — to drive budgets (§5/§12), cost burndown (§17), and learned routing (§14 P5).

- [ ] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries.
- [ ] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` — same caveat as `budget.usd/tokens` (§5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction.
- [ ] **Aggregation/rollups:** per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14).
- [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present.
- [ ] **Privacy:** telemetry carries metrics + pointers only — **never prompt content or secrets** (redaction §12).
+- [x] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries. *(P1-S3 single-host: recorded in `<job>.meta` — `duration_s`, `files_changed`/`lines_added`/`lines_deleted`, tokens/cost/turns/tool_calls, `attempts`; CPU time not captured.)*
+- [x] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` — same caveat as `budget.usd/tokens` (§5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction. *(P1-S3: `parse_usage` adapter; generic `AQ_USAGE` line + Claude/Codex heuristics; Devin/Copilot TODO; `usage_estimated` flag, never fabricated.)*
+- [ ] **Aggregation/rollups:** per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14). *(P1-S3 partial: `aq insights` does per-job + per-engine rollup; product/factory/profile/day are P2/P3.)*
+- [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present. *(P1-S3 partial: edge CLI `aq insights` + `status`/`dash` insights line done; web control-plane panels are P3.)*
+- [x] **Privacy:** telemetry carries metrics + pointers only — **never prompt content or secrets** (redaction §12). *(P1-S3: insights/meta record only metrics; no prompt body or secrets added.)*
 - **Acceptance:** after a run, its `fleet_runs` carries token/cost/duration/tool/diff metrics (real where metered, flagged `estimated` otherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone.
 - **Verify gate:** telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified.