From 3ad9500623a7e2f3b44aecc8f6b2bdc1b1d4da63 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Fri, 29 May 2026 17:15:28 -0700 Subject: [PATCH] docs(agent-queue): harden gigafactory roadmap after principal review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix correctness/distributed-systems bugs and fill gaps in place: - atomic claim (optimistic concurrency/_etag), fencing token (leaseEpoch), coordinator-authoritative time added to core contract + scheduler + factory - lease reclaim via coordinator reaper, not Cosmos TTL (TTL only GCs rows) - split-brain/partition safety: fencing + distributed lock + quarantine - budget: wall is the only hard ceiling; usd/tokens best-effort (provider metering) - SSE live logs cannot use the buffering tracker proxy; use a streaming route + blob log storage (fleet_artifacts container) - manifest: capability grammar, engine-class enum, idempotency 409 + deps-satisfied semantics, dep cycle detection - tracker status mapping table + PR-flow ship semantics (merged+green vs pr-opened) - station/seat capacity, factory health definition, enrollment/bootstrap auth - Cosmos RU/indexing + claim-loop poll cost; add new sections: rollout/rollback & data migration (§21), capacity planning & cost (§22), ownership & RACI (§23) - success metrics now carry provisional SLO targets; Phase 2 checklist + index synced --- agent-queue/docs/GIGAFACTORY_ROADMAP.md | 197 ++++++++++++++++++------ 1 file changed, 147 insertions(+), 50 deletions(-) diff --git a/agent-queue/docs/GIGAFACTORY_ROADMAP.md b/agent-queue/docs/GIGAFACTORY_ROADMAP.md index 357e489..818d029 100644 --- a/agent-queue/docs/GIGAFACTORY_ROADMAP.md +++ b/agent-queue/docs/GIGAFACTORY_ROADMAP.md @@ -17,7 +17,7 @@ | **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite | | **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B | -Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. +Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.** --- @@ -97,8 +97,11 @@ Today's `agent-queue.sh` + `dashboard.mjs` (single host, zero-dep bash + Node): - [ ] Every job has a stable **id**, an immutable **manifest**, and an append-only **event log**. - [ ] Every Cosmos document carries `productId` (ByteLyst rule). -- [ ] A job in flight is always covered by exactly one **lease**; no lease → reclaimable. -- [ ] Lifecycle stages are canonical and shared: `queued → assigned → building → review → testing → shipped` (+ `failed`, `dead_letter`). +- [ ] A job in flight is always covered by exactly one **lease**; no live lease → reclaimable. +- [ ] **Atomic claim:** a job is assigned to exactly one worker via optimistic concurrency (Cosmos `_etag`/`If-Match` or a conditional `fleet_leases` insert keyed by `jobId`). Concurrent claimers — exactly one wins; losers retry the next candidate. +- [ ] **Fencing token:** every lease carries a monotonic `leaseEpoch`. Every report/commit/ship carries its epoch; the coordinator **rejects writes from a stale epoch**, so a partitioned or zombie worker cannot corrupt state after its lease was reclaimed. +- [ ] **Coordinator-authoritative time:** all lease/TTL/SLA math uses server timestamps, never factory clocks (clock-skew safety). +- [ ] Lifecycle stages are canonical and shared: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`). - [ ] The bash runner and the service speak the **same manifest + event vocabulary** (one schema, two transports). --- @@ -122,7 +125,9 @@ engine-class: agentic-coder # abstract; scheduler picks a concrete engine i capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy prefers: [factory:mac-2] # soft routing hints (affinity) priority: high # critical|high|medium|low → SLA + preemption -budget: { usd: 5, tokens: 2M, wall: 4h } # hard ceilings; exceed → pause/fail +budget: { usd: 5, tokens: 2M, wall: 4h } # wall = HARD ceiling (always enforceable). usd/tokens = best-effort + # caps: enforced only where the engine/provider exposes live metering; + # otherwise estimated from provider usage APIs post-hoc + alerted. deps: [job-123, job-456] # DAG: don't start until these reach `shipped`/`testing` idempotency-key: nomgap-ux-2 # dedupe: a second identical submit is a no-op retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] } @@ -134,10 +139,12 @@ tracker-item: ITEM-789 # link back to the originating tracker task - [ ] Define the manifest schema (Zod in the service; documented YAML for `.md`). - [ ] Backward-compat: a Phase-0 `.md` (only `engine/cwd/yolo`) parses with all new fields defaulted. -- [ ] `idempotency-key` dedupe semantics specified (same key + same content hash = no-op). -- [ ] `deps` DAG semantics specified (blocked state, cycle detection, fan-in/out). -- **Acceptance:** a manifest fixture suite parses/validates; invalid manifests fail with precise errors. -- **Verify gate:** schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases). +- [ ] **Capability grammar** defined: tokens are `key` (presence, e.g. `has:xcode`), `key:value` (e.g. `os:mac`, `engine:devin`), or `keyversion` with `op ∈ {>=,>,=,<=,<}` (e.g. `node>=20`). `os:any` is a wildcard that matches every factory. A job matches a factory iff every required token is satisfied by the factory descriptor. +- [ ] **`engine-class` taxonomy** defined as an enum (`agentic-coder`, `chat-coder`, `review-only`) with a documented engine→class map (`devin,claude,codex → agentic-coder`; `copilot → chat-coder`). If `engine` is set it wins; else the scheduler picks any free engine in the class honoring `prefers-engine`. +- [ ] **`idempotency-key` semantics:** `key + content-hash` identical ⇒ no-op (returns existing job). Same `key`, **different** content ⇒ **rejected with 409** unless the prior job is still `queued`/`blocked` (then it is superseded). A re-`run`/`retry` of an existing job is **not** a new submit and never trips dedupe. +- [ ] **`deps` semantics:** a dep is satisfied when it reaches `shipped` (default) or `testing` if `deps-mode: soft`. Submit-time **cycle detection** rejects cyclic graphs; unmet deps put the job in `blocked` (not `queued`). Cross-factory deps require the coordinator (P2); single-host deps work in P1. +- **Acceptance:** a manifest fixture suite parses/validates; invalid manifests fail with precise errors; capability-grammar + dep-cycle + idempotency-conflict cases covered. +- **Verify gate:** schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases + grammar/cycle/409 cases). --- @@ -161,7 +168,7 @@ review-policy: manual ``` - [ ] Author starter catalog: `developer`, `backend-engineer`, `frontend-engineer`, `ux-designer`, `ui-designer`, `qa`, `reviewer`, `docs-writer`. -- [ ] Persona overlay is **prepended** to the job body before the agent runs (and stripped from logs of secrets). +- [ ] Persona overlay is **prepended** to the job body before the agent runs; secrets are never written to logs or the event stream (redaction at the source). - [ ] Profile supplies default `verify`, `capabilities`, `engine-class`, `allowed-scope` when the job omits them. - [ ] Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time). - [ ] `allowed-scope` enforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check). @@ -182,8 +189,10 @@ Given a `queued` job and the current fleet, choose `(factory, station/engine, pr 3. **Score** each candidate factory: `score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health − w6·starvationPenalty` 4. **Tie-break:** highest priority job first; then oldest; then lowest cost class. -5. **Assign** → write lease (TTL), set job `assigned`, decrement station capacity, bump fairness counter. -6. **Preemption (P3+):** a `critical` job may pause a `low` job at a needed station (checkpoint + requeue). +5. **Assign atomically** → create the lease under an optimistic-concurrency guard (`_etag`/`If-Match` or conditional insert keyed by `jobId`) **with a fresh `leaseEpoch`**; on conflict another factory won → retry the next candidate. Set job `assigned`, decrement station/seat capacity, bump fairness counter. Use **coordinator-authoritative timestamps** only. +6. **Preemption (P3+):** a `critical` job may pause a `low` job at a needed station (checkpoint + requeue, bumping the preempted job's `leaseEpoch`). + +> **Phasing:** Phase 2 ships the deterministic **filter + atomic-assign core** (fixed weights). Phase 3 adds **tunable weights, preemption, and the explainability UI**. Phase 5 learns the weights (§14). - [ ] Implement pure, unit-testable scoring function (no I/O) with configurable weights. - [ ] Hard-filter correctness: never assign a job to a factory missing a required capability. @@ -191,8 +200,11 @@ Given a `queued` job and the current fleet, choose `(factory, station/engine, pr - [ ] Fairness: no factory or product starves under sustained load (counter + penalty). - [ ] Explainability: every assignment records *why* (matched caps, score breakdown) in the event log. - [ ] Determinism: same inputs → same decision (seeded tie-breaks) for testability. -- **Acceptance:** scenario fixtures (10+) produce expected assignments incl. starvation + capability-miss + budget-exceed. -- **Verify gate:** router unit suite ≥ 95% branch coverage on the scoring/filter core. +- [ ] Define **factory health** ∈ [0,1] = f(heartbeat freshness, recent run failure-rate, resource pressure); factories below a health floor are **filtered out**, not merely down-weighted. +- [ ] **Station/seat capacity:** a factory's free stations = `min(host slots, per-engine seat limits)` (e.g. licensed Devin/Claude seats); the scheduler never over-subscribes a seat-limited engine. +- [ ] **Distributed lock:** the Phase-0 local `lock` becomes a **coordinator-held lock** so same-`lock` jobs serialize across the whole fleet (prevents two factories pushing the same repo concurrently). +- **Acceptance:** scenario fixtures (10+) produce expected assignments incl. starvation, capability-miss, seat-exhaustion, unhealthy-factory exclusion, and budget-exceed; a concurrent-claim race test proves exactly one winner. +- **Verify gate:** router unit suite ≥ 95% branch coverage on the scoring/filter core; atomic-claim race test. --- @@ -201,13 +213,15 @@ Given a `queued` job and the current fleet, choose `(factory, station/engine, pr Each machine runs a **factory agent** (the evolved `agent-queue` runner) that registers, heartbeats, claims jobs, and reports events. - [ ] **Capability auto-detection** at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values). +- [ ] **Enrollment / bootstrap trust**: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a **scoped, rotatable factory token** (`jose` JWT); decommission = revoke. No standing shared secret in the queue. - [ ] **Registration**: `POST /fleet/factories` with descriptor → receives a factory id + token. -- [ ] **Heartbeat**: periodic `PUT /fleet/factories/:id/heartbeat` (load, free stations, health); missed N → factory marked `offline`, its leases reclaimed. -- [ ] **Claim loop**: `POST /fleet/leases/claim` advertising capabilities/free stations; receives a job + lease TTL. -- [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`); renew lease while alive. +- [ ] **Heartbeat**: periodic `PUT /fleet/factories/:id/heartbeat` (load, free stations, health). A **coordinator lease reaper** (not Cosmos TTL) sweeps `expiresAt < now` and reclaims, **bumping `leaseEpoch`** so the dead/zombie worker is fenced; a factory missing N heartbeats is marked `offline` and all its leases reclaimed. +- [ ] **Claim loop**: `POST /fleet/leases/claim` advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + `leaseEpoch`. Use **claim backoff / long-poll** to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4. +- [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`), **echoing `leaseEpoch`** (stale epoch → 409, worker self-aborts); renew lease while alive. +- [ ] **Environment prep**: before `verify`, the factory ensures deps are installed (cold checkout → `pnpm install`); prep time counts against `budget.wall`. - [ ] **Graceful drain**: factory can stop claiming, finish in-flight, deregister. -- **Acceptance:** a factory registers, claims a matching job, heartbeats, completes, and a killed factory's job is reclaimed by another within the lease TTL. -- **Verify gate:** factory-agent integration test against a mock coordinator; crash-recovery test. +- **Acceptance:** a factory enrolls, claims a matching job, heartbeats, completes; a killed factory's job is reclaimed by another within the lease TTL and the killed worker's late report is **rejected by fencing**. +- **Verify gate:** factory-agent integration test against a mock coordinator; crash-recovery + fencing-rejection test. --- @@ -223,9 +237,11 @@ Three transports were evaluated. **Decision: platform-service-native coordinator - [ ] Document the decision + rationale in-repo (this section is the canonical record). - [ ] Define the **claim/lease protocol** once; both git-queue (poll) and service (API) implement it. -- [ ] Offline-degrade: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect (idempotency-key prevents double-execution). -- **Acceptance:** the same job manifest runs identically through the bash/git path and the service path. -- **Verify gate:** contract test asserting protocol parity (git vs service). +- [ ] **Split-brain / network-partition safety:** a partitioned factory may keep running and even `git push`. `idempotency-key` dedupes *submits* but cannot undo *side-effects*. Mitigation: **fencing** — the coordinator rejects `ship`/merge reports from a stale `leaseEpoch`, and the distributed `lock` (§7) prevents a reclaimed-job's twin from pushing the same repo. Residual risk (a stale push to a feature branch) is contained by the PR-merge ship gate (§10) and surfaced for human triage. +- [ ] **Offline-degrade**: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect; on reconnect it presents its `leaseEpoch` — if reclaimed, its results are quarantined, not auto-merged. +- [ ] **Poll cost**: bound claim-loop RU via long-poll/backoff (§22); migrate to broker push at P4. +- **Acceptance:** the same job manifest runs identically through the bash/git path and the service path; a simulated partition does not double-merge (fencing test). +- **Verify gate:** contract test asserting protocol parity (git vs service) + partition/fencing test. --- @@ -241,6 +257,17 @@ Three transports were evaluated. **Decision: platform-service-native coordinator - **Acceptance:** filing a tracker task, marking it `agent:run`, results in a queued job; on ship, the item flips to `done` with a SHA comment. - **Verify gate:** adapter e2e against a tracker-service test instance (or mock); round-trip assertion. +**Stage → tracker status mapping** (tracker's enum is coarser than the fleet's; keep fine-grained stage in a label + comment so no detail is lost): + +| Fleet stage | Tracker `status` | Extra | +| ----------- | ---------------- | ----- | +| `queued` / `assigned` / `blocked` | `in_progress` | label `fleet:` | +| `building` / `review` / `testing` | `in_progress` | label `fleet:` + progress comment | +| `shipped` | `done` | comment with SHA(s)/PR link/verify result | +| `failed` / `dead_letter` | `in_progress` + label `needs-triage` | never auto-`closed`/`wont_fix` (humans decide) | + +**Ship semantics (PR flow):** `shipped` = change **merged to target branch with CI green** (default), OR `pr-opened` when `review-policy` defers merge to humans/CI — configurable per profile. This honors the non-goal that CI still gates merges (§3); the agent never bypasses branch protection. + ### Phase 2 — Native spine - [ ] Stand up a `fleet` (a.k.a. `orchestrator`) module **inside platform-service**, sibling to `tracker-service`: pattern `types.ts → repository.ts → routes.ts`, ESM, Cosmos, `productId`, `req.log`. - [ ] Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats. @@ -249,7 +276,8 @@ Three transports were evaluated. **Decision: platform-service-native coordinator - **Verify gate:** module test suite (repository + routes) using the shared `@bytelyst/testing` inject helpers. ### Phase 3 — Unified control plane -- [ ] Add a **Fleet** surface to `tracker-web` reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, **live log streaming (SSE)**, lease/heartbeat status, cost burndown, approve/ship buttons. +- [ ] Add a **Fleet** surface to `tracker-web` reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, **live log streaming**, lease/heartbeat status, cost burndown, approve/ship buttons. +- [ ] **Streaming caveat (correctness):** live logs **must not** use the existing buffering catch-all proxy `/api/tracker/[...path]` — it does `res.text()` and would never stream. Use a **dedicated Next.js Route Handler returning a `ReadableStream` (SSE)** or a direct SSE/WebSocket to platform-service. Full logs are shipped to blob storage (§17); the endpoint serves stored tail + live append. - [ ] The Node TUI dashboard becomes a thin client of the same `/fleet` API (parity with web). - **Acceptance:** an operator can watch all factories + tail any job log + ship from the browser. - **Verify gate:** web e2e (Playwright) covering fleet map render, live log, and a ship action. @@ -258,13 +286,15 @@ Three transports were evaluated. **Decision: platform-service-native coordinator ## 11. Lifecycle & gates at scale (feature) -- [ ] Canonical stages enforced server-side: `queued → assigned → building → review → testing → shipped` (+ `failed`, `dead_letter`). +- [ ] Canonical stages enforced server-side: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`); transitions validated (illegal transition → 409). - [ ] Per-profile default `verify`; per-job override; verify runs at the factory, result reported as an event. - [ ] Human gates: `review-policy` routes to reviewers; multi-reviewer support (P3). - [ ] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped. - [ ] **Backpressure**: when no factory can take more, jobs stay `queued` (no thrash); SLA timers visible. -- **Acceptance:** a perpetually-failing job lands in `dead_letter` after configured retries; a passing one auto-advances to `testing` then waits for human `ship`. -- **Verify gate:** lifecycle state-machine unit tests (all transitions + illegal-transition rejection). +- [ ] **Ship semantics** are profile-configurable (merged+green vs `pr-opened`, §10); `shipped` is terminal-success, `dead_letter` terminal-failure; `blocked` (unmet deps) is distinct from `queued`. +- [ ] **Retry vs idempotency**: a retry creates a new `fleet_runs` attempt under the same job/`idempotency-key` (never a duplicate job); backoff honored; `retry.on` filters which failure classes retry. +- **Acceptance:** a perpetually-failing job lands in `dead_letter` after configured retries; a passing one auto-advances to `testing` then waits for human `ship`; an illegal transition is rejected. +- **Verify gate:** lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path). --- @@ -290,12 +320,15 @@ Each container partitioned sensibly; every doc has `productId`. - [ ] `fleet_jobs` (pk `/productId`) — manifest snapshot, current stage, idempotency-key, tracker-item link. - [ ] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, cost, exit, verify result. -- [ ] `fleet_leases` (pk `/jobId`) — holder factory, TTL, renewals; TTL index for auto-expiry. -- [ ] `fleet_factories` (pk `/productId`) — descriptor, capabilities, health, load, last heartbeat. -- [ ] `fleet_profiles` (pk `/productId`) — versioned profile snapshots. -- [ ] `fleet_events` (pk `/jobId`) — append-only audit/event stream (stage changes, logs ptr, cost ticks, decisions). +- [ ] `fleet_leases` (pk `/jobId`) — holder factory, `expiresAt`, **`leaseEpoch` (fencing)**, renewals. **Reclaim via a coordinator reaper** that scans `expiresAt < now` — Cosmos TTL only garbage-collects stale rows, it **cannot trigger reclaim logic**. Claim guarded by `_etag`/`If-Match`. +- [ ] `fleet_factories` (pk `/productId`) — descriptor, capabilities, health, load, last heartbeat, seat limits. +- [ ] `fleet_profiles` (pk `/productId`) — versioned profile snapshots (immutable per version). +- [ ] `fleet_events` (pk `/jobId`) — append-only audit/event stream (stage changes, log pointers, cost ticks, scheduler decisions). +- [ ] `fleet_artifacts` (pk `/jobId`) — pointers to **blob-stored** logs + artifacts (coverage, screenshots, build output). Large logs live in `@bytelyst/blob`, **never** inline in Cosmos (doc-size + RU limits). - [ ] Relate to existing tracker `Item` via `tracker-item` (no duplication of planning data). -- **Acceptance:** repository CRUD + query tests per container; lease TTL expiry verified. +- [ ] **Optimistic concurrency** (`_etag`) on every job stage transition + lease claim to prevent lost updates / double-assignment. +- [ ] **Indexing/RU**: the claim query is hot — index `stage`, `priority`, `capabilities`; avoid cross-partition fan-out; provision RU/s per §22. +- **Acceptance:** repository CRUD + query tests per container; **atomic-claim race test (N concurrent claimers → exactly one wins)**; reaper-reclaim + fencing-rejection test; lease-expiry verified via reaper (not TTL). - **Verify gate:** repository unit/integration tests (memory + Cosmos provider via `DB_PROVIDER`). --- @@ -325,14 +358,15 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until - [ ] Scaffold `fleet`/`orchestrator` module in `platform-service` (`types/repository/routes`, Zod, ESM, `productId`). - [ ] Cosmos containers (§13) + repository layer (memory + Cosmos providers). -- [ ] Claim/lease protocol endpoints + TTL expiry + reclaim (§8, §9). -- [ ] Port `agent-queue` runner to a **factory agent** API client (register/heartbeat/claim/report) while keeping git-queue fallback. -- [ ] Scheduler/router core (§7) as a pure module + wired into assignment. +- [ ] **Atomic claim** (optimistic concurrency / `_etag`) + **lease reaper** + **fencing (`leaseEpoch`)** endpoints (§4/§8/§9) — *not* Cosmos-TTL-driven reclaim. +- [ ] Port `agent-queue` runner to a **factory agent** API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback. +- [ ] Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment. - [ ] Tracker adapter calls the module directly (not just file export). -- [ ] Auth: factory tokens; scoped; secret isolation enforced (§12 subset). -- [ ] Module test suite (repository + routes via `@bytelyst/testing`); crash-recovery + lease-expiry tests. +- [ ] Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset). +- [ ] **Feature flags** (`fleet.enabled`, `fleet.route_via_service`) + **shadow/dual-run** vs P1 before cutover (§21). +- [ ] Module test suite (repository + routes via `@bytelyst/testing`); **atomic-claim race**, crash-recovery, fencing-rejection, reaper-reclaim tests. - [ ] Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end. -- **Exit criteria:** all boxes ✅; `pnpm --filter @lysnrai/platform-service test` green; killing a factory mid-job → another reclaims and completes; all state in Cosmos with `productId`. +- **Exit criteria:** all boxes ✅; `pnpm --filter @lysnrai/platform-service test` green; killing a factory mid-job → another reclaims and completes **and the dead worker's late report is fenced**; concurrent claimers never double-assign; all state in Cosmos with `productId`; **flag-off rollback verified** (§21). ### Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router **Goal:** one browser control plane; smart routing + budgets live. @@ -388,6 +422,10 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until | Security/scope/secret isolation | P1→P2 | §12 | | Broker + autoscaling | P4 | §14 | | Learned routing | P5 | §14 | +| Atomic claim + fencing + distributed lock | P2 | §4/§7/§9 | +| Rollout / rollback / feature flags | P2→ | §21 | +| Capacity planning & RU/cost | P2→ | §22 | +| Ownership & RACI / on-call | all | §23 | --- @@ -410,21 +448,26 @@ A feature/phase is **not done** until **every** item below is true (this is the ## 17. Observability & control plane details -- [ ] **Live logs** via SSE from factory → coordinator → web/TUI (single stream contract). -- [ ] **Metrics**: queue depth, assign latency, run duration, verify pass-rate, cost, factory utilization, fairness. -- [ ] **Alerting**: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter. -- [ ] **Tracing**: a job's full timeline (queued→…→shipped) reconstructable from `fleet_events`. +- [ ] **Log transport/storage**: factory ships logs to blob (`@bytelyst/blob`); `fleet_events` carries pointers + a recent-tail buffer. The control plane serves stored tail + live append (via the streaming route, **not** the buffering proxy — §10). +- [ ] **Live logs** via SSE (single stream contract) from the streaming endpoint to web/TUI. +- [ ] **Metrics**: queue depth, `blocked` count, assign latency, claim-loop RU/s, run duration, verify pass-rate, cost, factory utilization, fairness, reclaim/fencing-rejection counts. +- [ ] **Alerting**: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter, **claim-race anomalies**, RU throttling (Cosmos 429s). +- [ ] **Tracing**: a job's full timeline (queued→…→shipped) reconstructable from `fleet_events` (immutable, ordered). - [ ] **Cost burndown** per job/product/day with budget overlays. +- [ ] **SLOs defined + dashboarded** (see §19 targets); error budget tracked per SLO. --- ## 18. Risks & gaps explicitly tracked (expert call-outs) -- [ ] **Duplicate execution** across transports (git fallback + service) — mitigated by `idempotency-key` + lease. -- [ ] **Crash recovery** — lease TTL + reclaim; checkpoint long jobs where engines allow. -- [ ] **Shared-package conflicts** — two jobs editing `@bytelyst/*` simultaneously → lock + reviewer gate. +- [ ] **Duplicate execution** across transports (git fallback + service) — `idempotency-key` (submit) + atomic lease (assign) + **fencing token** (side-effect) + distributed `lock` (push). +- [ ] **Crash recovery** — coordinator **lease reaper + fencing** (not Cosmos TTL); checkpoint long jobs where engines allow. +- [ ] **Split-brain / partition** — fencing rejects stale `leaseEpoch` writes; reclaimed-job results quarantined, not auto-merged (§9). +- [ ] **Shared-package conflicts** — two jobs editing `@bytelyst/*` simultaneously → fleet-wide `lock` + reviewer gate. - [ ] **Starvation/fairness** — per-product + per-factory counters with penalty. -- [ ] **Cost runaway** — hard budgets + global kill switch. +- [ ] **Cost runaway** — `budget.wall` hard ceiling everywhere; `usd`/`tokens` best-effort (provider metering) + global kill switch. +- [ ] **Cosmos RU throttling (429)** — hot claim path; bound via long-poll/backoff + indexing (§13/§22); broker offload at P4. +- [ ] **Clock skew** — coordinator-authoritative timestamps for all lease/SLA math (§4). - [ ] **Tool-version drift / reproducibility** — record engine + tool versions per run; pin where possible. - [ ] **Windows quirks** — path/shell differences in the factory agent; capability-gate Windows-only work. - [ ] **Human-review bottleneck** — auto-verify as much as possible; batch review UI; reviewer routing. @@ -437,12 +480,19 @@ A feature/phase is **not done** until **every** item below is true (this is the ## 19. Success metrics -- Throughput: jobs shipped/day; parallel utilization (% of fleet busy). -- Quality: % auto-verified, first-pass success rate, escaped-defect rate, human-edit rate post-agent. -- Speed: mean time queued→shipped; assign latency. -- Cost: $/shipped job; budget-breach rate. -- Reliability: lease-reclaim success, dead-letter rate, factory uptime. -- Fairness: max/min product wait-time ratio. +Each metric has a **provisional SLO target** (tune with real data; tracked with an error budget): + +| Dimension | Metric | Provisional SLO target | +| --------- | ------ | ---------------------- | +| Throughput | jobs shipped/day; parallel utilization | utilization ≥ 60% under backlog | +| Quality | % auto-verified; first-pass success; escaped-defect; post-agent human-edit rate | first-pass ≥ 70%; escaped-defect < 2% | +| Speed | assign latency; time queued→shipped (excl. human gate) | assign p95 < 5s; queue-wait p95 < 2m at target load | +| Cost | $/shipped job; budget-breach rate | budget-breach < 1% of jobs | +| Reliability | lease-reclaim success; dead-letter rate; factory uptime; double-execution incidents | reclaim success ≥ 99.9%; **double-merge = 0**; dead-letter < 5% | +| Fairness | max/min product wait-time ratio | ratio < 3× | +| Correctness | atomic-claim violations; fencing rejections functioning | claim violations = 0 | + +> Targets are starting points; the §0 owners ratify per-phase SLOs before that phase's exit. --- @@ -457,5 +507,52 @@ A feature/phase is **not done** until **every** item below is true (this is the --- +## 21. Rollout, rollback & data migration + +Each phase ships behind controls so it can be turned off without losing work. + +- [ ] **Feature-flagged rollout**: gate each phase's new path behind a platform feature flag (`fleet.enabled`, `fleet.route_via_service`, `fleet.tracker_sync`); default off; enable per-product first. +- [ ] **Dual-run / shadow**: P2 coordinator runs in shadow (assign decisions logged, not executed) alongside the P0/P1 path before cutover; compare decisions. +- [ ] **Cutover is reversible**: a factory can fall back from service-claim to git-queue via flag; no schema-destructive step on the rollback path. +- [ ] **Data migration**: introducing Cosmos containers (P2) is **additive** — no migration of existing tracker data; backfill is read-only (link `tracker-item`, don't mutate). Container creation is idempotent (registered in `cosmos-init`). +- [ ] **Backward-compat gate**: every phase re-runs Phase-0 `selftest.sh` + a corpus of legacy `.md` files (regression). +- [ ] **Rollback drill**: each phase's exit includes a tested rollback (flag off → prior behavior, in-flight jobs drain or requeue cleanly). +- **Acceptance:** flipping `fleet.*` flags off returns the system to the prior phase's behavior with zero data loss; in-flight jobs either complete or requeue. +- **Verify gate:** rollout/rollback drill documented + a flag-off regression run is green. + +--- + +## 22. Capacity planning & cost + +- [ ] **Concurrency model**: fleet throughput = Σ factory free-stations, bounded by per-engine **seat limits** (e.g. N Devin seats) — document seat inventory per engine before P2. +- [ ] **Cosmos RU budgeting**: the claim/heartbeat paths are the hot loops. Estimate RU/s = (factories × claim-poll rate × query RU) + (factories × heartbeat rate × upsert RU); pick **long-poll interval** to keep steady-state RU within a provisioned budget; enable autoscale RU with a ceiling + 429 alerting. +- [ ] **Polling vs push**: at F factories the poll RU grows linearly — define the F threshold that triggers the P4 broker migration. +- [ ] **Blob storage**: logs/artifacts sizing + lifecycle (hot → cool → delete) per retention policy (§18). +- [ ] **Factory sizing**: per-OS resource baseline (CPU/RAM/disk for N concurrent agent sessions + warm checkouts); disk pressure as a health input. +- [ ] **Cost guardrails**: per-product spend caps + alerts; ties to `budget` and the global kill-switch. +- **Acceptance:** a documented capacity sheet (seats, RU/s, blob GB, factory specs) sized for the target steady-state + 2× burst. +- **Verify gate:** load test sustains target throughput within the RU/cost budget (no 429 storms). + +--- + +## 23. Ownership & RACI + +Owners are roles, not names — assign before each phase starts (this removes the "undefined owner" gap). + +| Area | Responsible (R) | Accountable (A) | Consulted (C) | Informed (I) | +| ---- | --------------- | --------------- | ------------- | ------------ | +| Runner / factory agent (bash) | DevOps eng | Platform lead | — | All | +| Coordinator module (platform-service) | Backend eng | Platform lead | Security | All | +| Scheduler/router | Distributed-systems eng | Platform lead | Backend | All | +| Control plane (tracker-web Fleet) | Frontend eng | Platform lead | UX | All | +| Security/governance | Security eng | Security lead | Platform | All | +| Capacity/cost & SLOs | SRE | Platform lead | Finance | All | +| Profiles & persona governance | Eng leads | Platform lead | — | All | + +- [ ] Each phase names its R/A before kickoff; SLOs (§19) ratified by A. +- [ ] On-call + runbooks established before the fleet runs unattended `yolo` workloads (Phase 2+). + +--- + *This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.*