diff --git a/agent-queue/docs/GIGAFACTORY_ROADMAP.md b/agent-queue/docs/GIGAFACTORY_ROADMAP.md index 16c066b..7367b01 100644 --- a/agent-queue/docs/GIGAFACTORY_ROADMAP.md +++ b/agent-queue/docs/GIGAFACTORY_ROADMAP.md @@ -318,8 +318,8 @@ Three transports were evaluated. **Decision: platform-service-native coordinator Each container partitioned sensibly; every doc has `productId`. -- [ ] `fleet_jobs` (pk `/productId`) — manifest snapshot, current stage, idempotency-key, tracker-item link. -- [ ] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, cost, exit, verify result. +- [ ] `fleet_jobs` (pk `/productId`) — manifest snapshot **+ the full instruction body verbatim as markdown (`bodyMd`)**, current stage, idempotency-key, tracker-item link, `checkpoint` pointer (WIP branch/commit). This is the **durable source of truth for instructions** — a factory holds only a transient materialized copy, so a machine going down loses nothing (§25). +- [ ] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, exit, verify result, **and execution insights: model, tokensIn/Out (+cached), cost (`estimated` flag), turns, tool-call counts, filesChanged, linesAdded/Deleted, attempt number** (§26). - [ ] `fleet_leases` (pk `/jobId`) — holder factory, `expiresAt`, **`leaseEpoch` (fencing)**, renewals. **Reclaim via a coordinator reaper** that scans `expiresAt < now` — Cosmos TTL only garbage-collects stale rows, it **cannot trigger reclaim logic**. Claim guarded by `_etag`/`If-Match`. - [ ] `fleet_factories` (pk `/productId`) — descriptor, capabilities, health, load, last heartbeat, seat limits. - [ ] `fleet_profiles` (pk `/productId`) — versioned profile snapshots (immutable per version). @@ -427,6 +427,8 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until | Capacity planning & RU/cost | P2→ | §22 | | Ownership & RACI / on-call | all | §23 | | Work hierarchy & composite delegation (roadmap/epic) | P3 (manual) → P5 (planner) | §24 | +| Durability, crash recovery & work preservation | P1 (orphan/retry/WIP) → P2 (lease/resume) | §25 | +| Execution insights & token accounting | P1 (capture) → P3 (rollup UI) | §26 | --- @@ -605,5 +607,52 @@ Composite work obeys the same SoT discipline as the core contract (§4 immutable --- +## 25. Durability, crash recovery & work preservation + +**Goal:** a machine power-off, daemon/agent crash, or network partition **never loses the job, its instructions, or in-progress work**, and never corrupts state. Recovery is automatic and idempotent. + +### 25.1 Instructions are durable (markdown in Cosmos) +- [ ] The **full job instruction body is persisted verbatim as markdown** in `fleet_jobs.bodyMd` (§13), alongside the structured manifest. The originating tracker `Item.description` also retains the human instruction text; the two are linked by `tracker-item`, never duplicated as competing truth (§24.5). +- [ ] A factory only ever holds a **transient materialized copy** (temp prompt file) fetched from the API — losing the factory loses nothing. On the offline edge, the `.md` file on disk is the durable copy and reconciles on reconnect (§9). + +### 25.2 Work-in-progress is preserved (checkpointing) +- [ ] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/`** at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to `main`/protected branches (§12 push policy). +- [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it. +- [ ] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window. + +### 25.3 Recovery is automatic, resumable, and fenced +- [ ] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded. +- [ ] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/` exists, the new worker **resumes from the checkpoint** instead of restarting from zero. +- [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of *visible* outcomes. +- [ ] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion → `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics — never silently dropped. +- [ ] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** (§13); recovery is idempotent (running it twice yields one recovery). + +### 25.4 Crash taxonomy (all handled) +| Failure | Detection | Recovery | +| ------- | --------- | -------- | +| Agent process crash (`rc≠0`) | exit code | retry policy → requeue or `failed`/`dead_letter` | +| Daemon/runner crash | lease not renewed | reaper reclaims → resume from checkpoint | +| Machine power-off / partition | missed heartbeats + lease expiry | reaper + fencing + WIP resume elsewhere | +| Coordinator restart | state in Cosmos | leases survive; in-flight reconciled on boot | + +- **Acceptance:** SIGKILL an agent and power-off a factory mid-run → another worker **resumes from the last checkpoint (not from zero)** and ships; instructions intact (read back from Cosmos `bodyMd`); **zero duplicate commits/merges**; a retry-exhausted job lands in `dead_letter`/`failed` with diagnostics. +- **Verify gate:** chaos tests — kill agent, kill runner, simulate partition; assert resume-from-checkpoint, fencing rejection of the stale worker, instruction integrity, and no double-merge. + +--- + +## 26. Execution insights & token accounting + +**Goal:** per-job/run visibility into **token usage, cost, model, latency, and tool activity** — to drive budgets (§5/§12), cost burndown (§17), and learned routing (§14 P5). + +- [ ] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries. +- [ ] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` — same caveat as `budget.usd/tokens` (§5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction. +- [ ] **Aggregation/rollups:** per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14). +- [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present. +- [ ] **Privacy:** telemetry carries metrics + pointers only — **never prompt content or secrets** (redaction §12). +- **Acceptance:** after a run, its `fleet_runs` carries token/cost/duration/tool/diff metrics (real where metered, flagged `estimated` otherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone. +- **Verify gate:** telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified. + +--- + *This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.*