docs(agent-queue): add durability/crash-recovery (§25) + execution insights/token accounting (§26)
- §13: fleet_jobs stores instruction bodyMd (durable md SoT) + checkpoint; fleet_runs carries token/cost/model/tool/diff metrics - §25: instructions durable in Cosmos md, WIP checkpoint branch aq/wip/<jobId>, orphan detection, resume-vs-restart, fencing, retry->dead_letter, crash taxonomy - §26: per-run token/cost/latency/tool insights, honest metered-vs-estimated capture, rollups, control-plane surfacing, secret redaction - feature-catalog rows for §25 and §26
This commit is contained in:
parent
470b2ce8d0
commit
beb225162a
@ -318,8 +318,8 @@ Three transports were evaluated. **Decision: platform-service-native coordinator
|
||||
|
||||
Each container partitioned sensibly; every doc has `productId`.
|
||||
|
||||
- [ ] `fleet_jobs` (pk `/productId`) — manifest snapshot, current stage, idempotency-key, tracker-item link.
|
||||
- [ ] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, cost, exit, verify result.
|
||||
- [ ] `fleet_jobs` (pk `/productId`) — manifest snapshot **+ the full instruction body verbatim as markdown (`bodyMd`)**, current stage, idempotency-key, tracker-item link, `checkpoint` pointer (WIP branch/commit). This is the **durable source of truth for instructions** — a factory holds only a transient materialized copy, so a machine going down loses nothing (§25).
|
||||
- [ ] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, exit, verify result, **and execution insights: model, tokensIn/Out (+cached), cost (`estimated` flag), turns, tool-call counts, filesChanged, linesAdded/Deleted, attempt number** (§26).
|
||||
- [ ] `fleet_leases` (pk `/jobId`) — holder factory, `expiresAt`, **`leaseEpoch` (fencing)**, renewals. **Reclaim via a coordinator reaper** that scans `expiresAt < now` — Cosmos TTL only garbage-collects stale rows, it **cannot trigger reclaim logic**. Claim guarded by `_etag`/`If-Match`.
|
||||
- [ ] `fleet_factories` (pk `/productId`) — descriptor, capabilities, health, load, last heartbeat, seat limits.
|
||||
- [ ] `fleet_profiles` (pk `/productId`) — versioned profile snapshots (immutable per version).
|
||||
@ -427,6 +427,8 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until
|
||||
| Capacity planning & RU/cost | P2→ | §22 |
|
||||
| Ownership & RACI / on-call | all | §23 |
|
||||
| Work hierarchy & composite delegation (roadmap/epic) | P3 (manual) → P5 (planner) | §24 |
|
||||
| Durability, crash recovery & work preservation | P1 (orphan/retry/WIP) → P2 (lease/resume) | §25 |
|
||||
| Execution insights & token accounting | P1 (capture) → P3 (rollup UI) | §26 |
|
||||
|
||||
---
|
||||
|
||||
@ -605,5 +607,52 @@ Composite work obeys the same SoT discipline as the core contract (§4 immutable
|
||||
|
||||
---
|
||||
|
||||
## 25. Durability, crash recovery & work preservation
|
||||
|
||||
**Goal:** a machine power-off, daemon/agent crash, or network partition **never loses the job, its instructions, or in-progress work**, and never corrupts state. Recovery is automatic and idempotent.
|
||||
|
||||
### 25.1 Instructions are durable (markdown in Cosmos)
|
||||
- [ ] The **full job instruction body is persisted verbatim as markdown** in `fleet_jobs.bodyMd` (§13), alongside the structured manifest. The originating tracker `Item.description` also retains the human instruction text; the two are linked by `tracker-item`, never duplicated as competing truth (§24.5).
|
||||
- [ ] A factory only ever holds a **transient materialized copy** (temp prompt file) fetched from the API — losing the factory loses nothing. On the offline edge, the `.md` file on disk is the durable copy and reconciles on reconnect (§9).
|
||||
|
||||
### 25.2 Work-in-progress is preserved (checkpointing)
|
||||
- [ ] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/<jobId>`** at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to `main`/protected branches (§12 push policy).
|
||||
- [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it.
|
||||
- [ ] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window.
|
||||
|
||||
### 25.3 Recovery is automatic, resumable, and fenced
|
||||
- [ ] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded.
|
||||
- [ ] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/<jobId>` exists, the new worker **resumes from the checkpoint** instead of restarting from zero.
|
||||
- [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of *visible* outcomes.
|
||||
- [ ] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion → `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics — never silently dropped.
|
||||
- [ ] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** (§13); recovery is idempotent (running it twice yields one recovery).
|
||||
|
||||
### 25.4 Crash taxonomy (all handled)
|
||||
| Failure | Detection | Recovery |
|
||||
| ------- | --------- | -------- |
|
||||
| Agent process crash (`rc≠0`) | exit code | retry policy → requeue or `failed`/`dead_letter` |
|
||||
| Daemon/runner crash | lease not renewed | reaper reclaims → resume from checkpoint |
|
||||
| Machine power-off / partition | missed heartbeats + lease expiry | reaper + fencing + WIP resume elsewhere |
|
||||
| Coordinator restart | state in Cosmos | leases survive; in-flight reconciled on boot |
|
||||
|
||||
- **Acceptance:** SIGKILL an agent and power-off a factory mid-run → another worker **resumes from the last checkpoint (not from zero)** and ships; instructions intact (read back from Cosmos `bodyMd`); **zero duplicate commits/merges**; a retry-exhausted job lands in `dead_letter`/`failed` with diagnostics.
|
||||
- **Verify gate:** chaos tests — kill agent, kill runner, simulate partition; assert resume-from-checkpoint, fencing rejection of the stale worker, instruction integrity, and no double-merge.
|
||||
|
||||
---
|
||||
|
||||
## 26. Execution insights & token accounting
|
||||
|
||||
**Goal:** per-job/run visibility into **token usage, cost, model, latency, and tool activity** — to drive budgets (§5/§12), cost burndown (§17), and learned routing (§14 P5).
|
||||
|
||||
- [ ] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries.
|
||||
- [ ] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` — same caveat as `budget.usd/tokens` (§5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction.
|
||||
- [ ] **Aggregation/rollups:** per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14).
|
||||
- [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present.
|
||||
- [ ] **Privacy:** telemetry carries metrics + pointers only — **never prompt content or secrets** (redaction §12).
|
||||
- **Acceptance:** after a run, its `fleet_runs` carries token/cost/duration/tool/diff metrics (real where metered, flagged `estimated` otherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone.
|
||||
- **Verify gate:** telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified.
|
||||
|
||||
---
|
||||
|
||||
*This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.*
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user