docs(gigafactory): reconcile overview + roadmap to current reality

- System overview: mark Phase 4 in-progress (M0 RU gate shipped), add
  fleet_queue_state container + GET /fleet/queue-state, document the heartbeat
  cadence vs 90s stale gotcha, the tracker-web caps=build form bug, the missing
  deregister API, and the ended=-race fix; drop the now-false "roadmap §0 stale"
  and "boxes 384/386 unticked" claims (both reconciled); link the redesign doc.
- Roadmap: §0 Phase 4 -> in progress (M0); align the Phase-2 §8 spec endpoint
  sketches to the as-built API (/fleet/factories/enroll, /factories/heartbeat,
  /fleet/claim) + note the heartbeat cadence and the M0 gate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
saravanakumardb1 2026-06-01 00:02:45 -07:00
parent fa1f1d1b30
commit 2993994273
2 changed files with 67 additions and 30 deletions

View File

@ -14,7 +14,7 @@
| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ✅ done | ~98% | adapter e2e + selftest | | **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ✅ done | ~98% | adapter e2e + selftest |
| **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ✅ done | ~98% | fleet e2e + module tests | | **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ✅ done | ~98% | fleet e2e + module tests |
| **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ✅ done | 100% | web e2e + router tests | | **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ✅ done | 100% | web e2e + router tests |
| **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite | | **4** | Message bus + autoscaling + cross-OS capability marketplace | ◐ in progress | ~10% | load/chaos suite — **M0 RU gate shipped** (`fleet_queue_state` + `GET /fleet/queue-state` + `AQ_FLEET_GATE`); broker/M1+ per `FLEET_DISPATCH_REDESIGN.md` |
| **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B | | **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B |
Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.** For the full current-state architecture, diagrams, code map, next steps and known gaps see **`GIGAFACTORY_SYSTEM_OVERVIEW.md`** (companion doc). Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.** For the full current-state architecture, diagrams, code map, next steps and known gaps see **`GIGAFACTORY_SYSTEM_OVERVIEW.md`** (companion doc).
@ -217,9 +217,13 @@ Each machine runs a **factory agent** (the evolved `agent-queue` runner) that re
- [ ] **Capability auto-detection** at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values). - [ ] **Capability auto-detection** at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
- [ ] **Enrollment / bootstrap trust**: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a **scoped, rotatable factory token** (`jose` JWT); decommission = revoke. No standing shared secret in the queue. - [ ] **Enrollment / bootstrap trust**: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a **scoped, rotatable factory token** (`jose` JWT); decommission = revoke. No standing shared secret in the queue.
- [ ] **Registration**: `POST /fleet/factories` with descriptor → receives a factory id + token. - [ ] **Registration**: `POST /fleet/factories/enroll` with descriptor → receives a factory id + one-time token (built as: registration == first heartbeat; enroll mints the scoped token).
- [ ] **Heartbeat**: periodic `PUT /fleet/factories/:id/heartbeat` (load, free stations, health). A **coordinator lease reaper** (not Cosmos TTL) sweeps `expiresAt < now` and reclaims, **bumping `leaseEpoch`** so the dead/zombie worker is fenced; a factory missing N heartbeats is marked `offline` and all its leases reclaimed. - [ ] **Heartbeat**: periodic `POST /fleet/factories/heartbeat` (load, free stations, health). A **coordinator lease reaper** (not Cosmos TTL) sweeps `expiresAt < now` and reclaims, **bumping `leaseEpoch`** so the dead/zombie worker is fenced; a factory missing N heartbeats is marked `offline` and all its leases reclaimed. **Cadence must be < the 90s stale threshold** (`AQ_FLEET_LEASE_RENEW_SEC`; fleet launcher uses 30s).
- [ ] **Claim loop**: `POST /fleet/leases/claim` advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + `leaseEpoch`. Use **claim backoff / long-poll** to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4. - [ ] **Claim loop**: `POST /fleet/claim` advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + `leaseEpoch`. Use **claim backoff / long-poll** to bound Cosmos RU under many idle factories (see §22); **Phase-4 M0 adds the `AQ_FLEET_GATE` skip** (`GET /fleet/queue-state`), and broker push replaces polling in M1+.
> The endpoint paths above are the **as-built** API (`/fleet/factories/enroll`,
> `/fleet/factories/heartbeat`, `/fleet/claim`) — see `GIGAFACTORY_SYSTEM_OVERVIEW.md`
> §9 and the fleet module README for the authoritative list.
- [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`), **echoing `leaseEpoch`** (stale epoch → 409, worker self-aborts); renew lease while alive. - [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`), **echoing `leaseEpoch`** (stale epoch → 409, worker self-aborts); renew lease while alive.
- [ ] **Environment prep**: before `verify`, the factory ensures deps are installed (cold checkout → `pnpm install`); prep time counts against `budget.wall`. - [ ] **Environment prep**: before `verify`, the factory ensures deps are installed (cold checkout → `pnpm install`); prep time counts against `budget.wall`.
- [ ] **Graceful drain**: factory can stop claiming, finish in-flight, deregister. - [ ] **Graceful drain**: factory can stop claiming, finish in-flight, deregister.

View File

@ -3,7 +3,11 @@
> Companion to `GIGAFACTORY_ROADMAP.md` (the source-of-truth spec & checklists). > Companion to `GIGAFACTORY_ROADMAP.md` (the source-of-truth spec & checklists).
> This document describes **what is actually built today**, how the pieces fit > This document describes **what is actually built today**, how the pieces fit
> together, the architecture diagrams, the code map across both repos, the next > together, the architecture diagrams, the code map across both repos, the next
> steps, and the known bugs/gaps. Last reviewed: **2026-05-30**. > steps, and the known bugs/gaps. Last reviewed: **2026-05-31**.
>
> The **Phase-4 plan + the as-built M0 RU gate** live in
> [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) — read it for the
> broker-backed dispatch design and the migration checklist.
--- ---
@ -32,12 +36,14 @@ API.
| **1** | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node `dash` field surfacing — **now also done** via fleet-dash tags. Effectively complete | | **1** | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node `dash` field surfacing — **now also done** via fleet-dash tags. Effectively complete |
| **2** | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are **done in code** but boxes 384/386 unticked in roadmap (see §11 Gaps) | | **2** | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are **done in code** but boxes 384/386 unticked in roadmap (see §11 Gaps) |
| **3** | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run | | **3** | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run |
| **4** | Message bus + autoscaling + capability marketplace | ☐ 0% | Not started — next major frontier | | **4** | Message bus + autoscaling + capability marketplace | 🟡 in progress | **M0 (RU gate) shipped** — see below. Broker (M1+) not started. Plan: [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) |
| **5** | Self-optimizing / learned routing | ☐ 0% | Not started | | **5** | Self-optimizing / learned routing | ☐ 0% | Not started |
> ⚠️ The **`GIGAFACTORY_ROADMAP.md` §0 progress table is stale** — it shows > **Phase-4 M0 (RU gate) is live (2026-05-31):** a per-product `fleet_queue_state`
> Phase 3 as "0% not started" although every Phase-3 box below it is ticked. See > doc holds a monotonic `version` (bumped on job create + every stage change);
> §11 (Bugs & Gaps) — this should be corrected. > factories with `AQ_FLEET_GATE=1` point-read `GET /fleet/queue-state` (~1 RU) and
> skip the expensive claim while nothing changed — cutting idle Cosmos RU without
> raising the local poll interval. Default OFF; the live fleet runs it on.
--- ---
@ -181,6 +187,7 @@ sequenceDiagram
| `fleet_profiles` | `/productId` | immutable, versioned persona/capability profile snapshot | | `fleet_profiles` | `/productId` | immutable, versioned persona/capability profile snapshot |
| `fleet_events` | `/jobId` | append-only audit stream (monotonic `seq`) — powers SSE | | `fleet_events` | `/jobId` | append-only audit stream (monotonic `seq`) — powers SSE |
| `fleet_artifacts` | `/jobId` | **pointers** to blob-stored artifacts (no inline logs) | | `fleet_artifacts` | `/jobId` | **pointers** to blob-stored artifacts (no inline logs) |
| `fleet_queue_state` | `/productId` | **Phase-4 M0 RU gate**: monotonic `version` bumped on job create + every stage change; read via `GET /fleet/queue-state` so a factory can cheaply detect "work changed" |
Every document carries `productId`. Containers registered in `lib/cosmos-init.ts`. Every document carries `productId`. Containers registered in `lib/cosmos-init.ts`.
@ -239,7 +246,7 @@ Budgets GET /fleet/budgets/:productId · /burndown
DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo
Metrics GET /fleet/metrics Metrics GET /fleet/metrics · GET /fleet/queue-state (Phase-4 M0 RU gate)
``` ```
--- ---
@ -334,9 +341,10 @@ agent-queue/
## 13. Next steps ## 13. Next steps
**Immediate (close Phase 13 to a clean 100%):** **Immediate (close Phase 13 to a clean 100%):**
1. **Fix the stale roadmap §0 table** and tick Phase-2 boxes 384 (scheduler wired — 1. **Validate the Cosmos `_etag`/`If-Match` CAS path under true contention** and
`selectJob` is used in `claimNextJob`) and 386 (enrollment + scoped tokens — **live blob-backed `fleet_artifacts`** — the two items the roadmap marks as
`enrollment.ts` + `enforceFactoryToken` are wired). (See §11 Gaps.) "remaining for a hard 100%" on Phase 2/3 (tests today use the memory provider +
pointer-only artifacts).
2. **Wire `e2e/fleet.spec.ts` into CI** (Playwright install + a `verify` job) so the 2. **Wire `e2e/fleet.spec.ts` into CI** (Playwright install + a `verify` job) so the
Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just
present. present.
@ -344,11 +352,14 @@ agent-queue/
3-repo parallel workload from the browser, including a budget pause + resume 3-repo parallel workload from the browser, including a budget pause + resume
against a real platform-service, not the stub). against a real platform-service, not the stub).
**Phase 4 (scale-out) — the next major frontier:** **Phase 4 (scale-out) — in progress; see [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md):**
4. Introduce a **broker** (NATS/Redis) for push dispatch + backpressure; coordinator - ✅ **M0 (done)** — RU gate: `fleet_queue_state` + `GET /fleet/queue-state` +
publishes, factories subscribe by capability (fallback to poll on outage). `AQ_FLEET_GATE`; factories skip the claim while the queue version is unchanged.
5. **Autoscaling hooks** — spin ephemeral factories (cloud VM/container) keyed to 4. **M1+: broker** (the redesign picks **Azure Service Bus**, not NATS/Redis, for
queue depth + SLA. subscription filters + DLQ) for push dispatch + backpressure in a
coordinator-owns-scheduling / broker-owns-delivery hybrid (keeps the scorer).
5. **M3: autoscaling** — scale-to-zero ephemeral factories (KEDA/Container Apps)
keyed to subscription depth.
6. **Capability marketplace** — route rare-capability jobs (xcode/figma/gpu) to the 6. **Capability marketplace** — route rare-capability jobs (xcode/figma/gpu) to the
few factories that have them; cross-product queueing fairness. few factories that have them; cross-product queueing fairness.
7. **Load + chaos suite** — factory churn, broker outage, thundering herd. 7. **Load + chaos suite** — factory churn, broker outage, thundering herd.
@ -362,15 +373,13 @@ agent-queue/
## 14. Bugs, gaps & risks (be honest) ## 14. Bugs, gaps & risks (be honest)
**Documentation drift (highest-signal, easy to fix):** **Documentation status (reconciled 2026-05-31):**
- `GIGAFACTORY_ROADMAP.md` **§0 progress table is wrong** — shows Phase 3 "0% not - `GIGAFACTORY_ROADMAP.md` §0 now reads Phase 0 ✅100% · 1 ✅~98% · 2 ✅~98% ·
started" while all Phase-3 boxes are ticked, and Phase 1/2 percentages (95%/80%) 3 ✅100% · **4 ◐ in progress (~10%, M0 shipped)** · 5 ☐. Phase-2 boxes for the
understate reality. scheduler core and factory enrollment/scoped tokens are ticked (`scheduler.ts`
- **Phase-2 boxes 384 & 386 are unticked but done in code.** `coordinator.ts` `selectJob`/`selectPreemptionVictim` wired into `claimNextJob`; `enrollment.ts`
imports/uses `selectJob` + `selectPreemptionVictim` in `claimNextJob`; `routes.ts` `enforceFactoryToken` gating claim/heartbeat). The earlier "stale §0 table"
enforces `enrollment.enforceFactoryToken` on claim/heartbeat and exposes warning no longer applies.
enroll/rotate/revoke. The roadmap's "remaining for 100%" note on line 390 is
outdated.
**Runtime / correctness gaps:** **Runtime / correctness gaps:**
- **SSE is poll-fallback based, not a push-only contract.** `subscribeJobEvents` - **SSE is poll-fallback based, not a push-only contract.** `subscribeJobEvents`
@ -397,14 +406,38 @@ agent-queue/
- TUI fleet mode has **no write path for budgets/preemption** — it's read + job - TUI fleet mode has **no write path for budgets/preemption** — it's read + job
actions only; budget pause/resume is web-only. actions only; budget pause/resume is web-only.
**Operational / not-yet-built (expected, Phase 4+):** **Operational gotchas (verified on the live fleet — get these right):**
- **No message bus** — dispatch is poll-based; no push/backpressure yet. - **Heartbeat cadence MUST be < the 90s stale threshold.** `fleet_metrics` marks a
- **No autoscaling** — factory fleet is static/manually run. factory stale after `DEFAULT_STALE_FACTORY_MS = 90_000`, but the factory only
heartbeats every `AQ_FLEET_LEASE_RENEW_SEC` (**default 300s**). Left at the
default, a healthy factory flaps to "stale"/"no live factory" between beats. The
fleet launcher sets `AQ_FLEET_LEASE_RENEW_SEC=30` to stay well inside the window.
- **The tracker-web New-Job form is misconfigured:** it hardcodes factories
`mac-1`/`mac-2` and defaults `capabilities=["build"]` — a token **no agent-queue
factory advertises** (`detect_capabilities` emits `os:*`/`engine:*`/`node:*`/`has:*`).
So a default UI submission is unroutable (queues forever → `queue_starvation`).
Fix tracked in the redesign doc's routing-model section.
- **No factory deregister API.** Only heartbeat/enroll/rotate/revoke exist, so a
dead factory's doc lingers and shows as `stale` until pruned out-of-band
(currently a manual Cosmos delete). A prune/deregister path is a Phase-4 item.
**Not-yet-built (expected, Phase 4+):**
- **No message bus yet** — dispatch is still poll-based, but the **M0 RU gate now
skips the claim while idle** (so idle Cosmos RU is near-flat). Broker push/
backpressure is M1+.
- **No autoscaling** — factory fleet is static/manually run (M3 target).
- **No capability marketplace / cross-product fairness** under contention. - **No capability marketplace / cross-product fairness** under contention.
- **No load/chaos test suite** — resilience is unit-proven, not load-proven. - **No load/chaos test suite** — resilience is unit-proven, not load-proven.
- **Artifacts blob wiring** (`fleet_artifacts` → real blob storage) should be - **Artifacts blob wiring** (`fleet_artifacts` → real blob storage) should be
validated against a live storage account (tests use memory/pointer only). validated against a live storage account (tests use memory/pointer only).
**Recently fixed (2026-05-31):**
- **`run --once` could return before a backgrounded worker finished the PR/report.**
`_meta_end` (which writes `ended=`) was called right after the `testing/` move,
*before* PR open/merge + coordinator reports, so the slot freed early and `--once`
could exit (and a caller could observe completion) mid-PR. Now `ended=` is written
last; the selftest PR-mode case is deterministic again.
--- ---
## 15. TL;DR ## 15. TL;DR