From 2993994273a8b86d7ebde53b128efd7ba261ede8 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Mon, 1 Jun 2026 00:02:45 -0700 Subject: [PATCH] docs(gigafactory): reconcile overview + roadmap to current reality MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - System overview: mark Phase 4 in-progress (M0 RU gate shipped), add fleet_queue_state container + GET /fleet/queue-state, document the heartbeat cadence vs 90s stale gotcha, the tracker-web caps=build form bug, the missing deregister API, and the ended=-race fix; drop the now-false "roadmap §0 stale" and "boxes 384/386 unticked" claims (both reconciled); link the redesign doc. - Roadmap: §0 Phase 4 -> in progress (M0); align the Phase-2 §8 spec endpoint sketches to the as-built API (/fleet/factories/enroll, /factories/heartbeat, /fleet/claim) + note the heartbeat cadence and the M0 gate. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> --- .../docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md | 12 ++- .../GIGAFACTORY_SYSTEM_OVERVIEW.md | 85 +++++++++++++------ 2 files changed, 67 insertions(+), 30 deletions(-) diff --git a/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md b/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md index 42ee181..a44dbdd 100644 --- a/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md +++ b/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md @@ -14,7 +14,7 @@ | **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ✅ done | ~98% | adapter e2e + selftest | | **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ✅ done | ~98% | fleet e2e + module tests | | **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ✅ done | 100% | web e2e + router tests | -| **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite | +| **4** | Message bus + autoscaling + cross-OS capability marketplace | ◐ in progress | ~10% | load/chaos suite — **M0 RU gate shipped** (`fleet_queue_state` + `GET /fleet/queue-state` + `AQ_FLEET_GATE`); broker/M1+ per `FLEET_DISPATCH_REDESIGN.md` | | **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B | Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.** For the full current-state architecture, diagrams, code map, next steps and known gaps see **`GIGAFACTORY_SYSTEM_OVERVIEW.md`** (companion doc). @@ -217,9 +217,13 @@ Each machine runs a **factory agent** (the evolved `agent-queue` runner) that re - [ ] **Capability auto-detection** at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values). - [ ] **Enrollment / bootstrap trust**: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a **scoped, rotatable factory token** (`jose` JWT); decommission = revoke. No standing shared secret in the queue. -- [ ] **Registration**: `POST /fleet/factories` with descriptor → receives a factory id + token. -- [ ] **Heartbeat**: periodic `PUT /fleet/factories/:id/heartbeat` (load, free stations, health). A **coordinator lease reaper** (not Cosmos TTL) sweeps `expiresAt < now` and reclaims, **bumping `leaseEpoch`** so the dead/zombie worker is fenced; a factory missing N heartbeats is marked `offline` and all its leases reclaimed. -- [ ] **Claim loop**: `POST /fleet/leases/claim` advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + `leaseEpoch`. Use **claim backoff / long-poll** to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4. +- [ ] **Registration**: `POST /fleet/factories/enroll` with descriptor → receives a factory id + one-time token (built as: registration == first heartbeat; enroll mints the scoped token). +- [ ] **Heartbeat**: periodic `POST /fleet/factories/heartbeat` (load, free stations, health). A **coordinator lease reaper** (not Cosmos TTL) sweeps `expiresAt < now` and reclaims, **bumping `leaseEpoch`** so the dead/zombie worker is fenced; a factory missing N heartbeats is marked `offline` and all its leases reclaimed. **Cadence must be < the 90s stale threshold** (`AQ_FLEET_LEASE_RENEW_SEC`; fleet launcher uses 30s). +- [ ] **Claim loop**: `POST /fleet/claim` advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + `leaseEpoch`. Use **claim backoff / long-poll** to bound Cosmos RU under many idle factories (see §22); **Phase-4 M0 adds the `AQ_FLEET_GATE` skip** (`GET /fleet/queue-state`), and broker push replaces polling in M1+. + +> The endpoint paths above are the **as-built** API (`/fleet/factories/enroll`, +> `/fleet/factories/heartbeat`, `/fleet/claim`) — see `GIGAFACTORY_SYSTEM_OVERVIEW.md` +> §9 and the fleet module README for the authoritative list. - [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`), **echoing `leaseEpoch`** (stale epoch → 409, worker self-aborts); renew lease while alive. - [ ] **Environment prep**: before `verify`, the factory ensures deps are installed (cold checkout → `pnpm install`); prep time counts against `budget.wall`. - [ ] **Graceful drain**: factory can stop claiming, finish in-flight, deregister. diff --git a/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md b/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md index 81e3edf..1052882 100644 --- a/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md +++ b/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md @@ -3,7 +3,11 @@ > Companion to `GIGAFACTORY_ROADMAP.md` (the source-of-truth spec & checklists). > This document describes **what is actually built today**, how the pieces fit > together, the architecture diagrams, the code map across both repos, the next -> steps, and the known bugs/gaps. Last reviewed: **2026-05-30**. +> steps, and the known bugs/gaps. Last reviewed: **2026-05-31**. +> +> The **Phase-4 plan + the as-built M0 RU gate** live in +> [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) — read it for the +> broker-backed dispatch design and the migration checklist. --- @@ -32,12 +36,14 @@ API. | **1** | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node `dash` field surfacing — **now also done** via fleet-dash tags. Effectively complete | | **2** | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are **done in code** but boxes 384/386 unticked in roadmap (see §11 Gaps) | | **3** | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run | -| **4** | Message bus + autoscaling + capability marketplace | ☐ 0% | Not started — next major frontier | +| **4** | Message bus + autoscaling + capability marketplace | 🟡 in progress | **M0 (RU gate) shipped** — see below. Broker (M1+) not started. Plan: [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) | | **5** | Self-optimizing / learned routing | ☐ 0% | Not started | -> ⚠️ The **`GIGAFACTORY_ROADMAP.md` §0 progress table is stale** — it shows -> Phase 3 as "0% not started" although every Phase-3 box below it is ticked. See -> §11 (Bugs & Gaps) — this should be corrected. +> **Phase-4 M0 (RU gate) is live (2026-05-31):** a per-product `fleet_queue_state` +> doc holds a monotonic `version` (bumped on job create + every stage change); +> factories with `AQ_FLEET_GATE=1` point-read `GET /fleet/queue-state` (~1 RU) and +> skip the expensive claim while nothing changed — cutting idle Cosmos RU without +> raising the local poll interval. Default OFF; the live fleet runs it on. --- @@ -181,6 +187,7 @@ sequenceDiagram | `fleet_profiles` | `/productId` | immutable, versioned persona/capability profile snapshot | | `fleet_events` | `/jobId` | append-only audit stream (monotonic `seq`) — powers SSE | | `fleet_artifacts` | `/jobId` | **pointers** to blob-stored artifacts (no inline logs) | +| `fleet_queue_state` | `/productId` | **Phase-4 M0 RU gate**: monotonic `version` bumped on job create + every stage change; read via `GET /fleet/queue-state` so a factory can cheaply detect "work changed" | Every document carries `productId`. Containers registered in `lib/cosmos-init.ts`. @@ -239,7 +246,7 @@ Budgets GET /fleet/budgets/:productId · /burndown DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo -Metrics GET /fleet/metrics +Metrics GET /fleet/metrics · GET /fleet/queue-state (Phase-4 M0 RU gate) ``` --- @@ -334,9 +341,10 @@ agent-queue/ ## 13. Next steps **Immediate (close Phase 1–3 to a clean 100%):** -1. **Fix the stale roadmap §0 table** and tick Phase-2 boxes 384 (scheduler wired — - `selectJob` is used in `claimNextJob`) and 386 (enrollment + scoped tokens — - `enrollment.ts` + `enforceFactoryToken` are wired). (See §11 Gaps.) +1. **Validate the Cosmos `_etag`/`If-Match` CAS path under true contention** and + **live blob-backed `fleet_artifacts`** — the two items the roadmap marks as + "remaining for a hard 100%" on Phase 2/3 (tests today use the memory provider + + pointer-only artifacts). 2. **Wire `e2e/fleet.spec.ts` into CI** (Playwright install + a `verify` job) so the Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just present. @@ -344,11 +352,14 @@ agent-queue/ 3-repo parallel workload from the browser, including a budget pause + resume against a real platform-service, not the stub). -**Phase 4 (scale-out) — the next major frontier:** -4. Introduce a **broker** (NATS/Redis) for push dispatch + backpressure; coordinator - publishes, factories subscribe by capability (fallback to poll on outage). -5. **Autoscaling hooks** — spin ephemeral factories (cloud VM/container) keyed to - queue depth + SLA. +**Phase 4 (scale-out) — in progress; see [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md):** +- ✅ **M0 (done)** — RU gate: `fleet_queue_state` + `GET /fleet/queue-state` + + `AQ_FLEET_GATE`; factories skip the claim while the queue version is unchanged. +4. **M1+: broker** (the redesign picks **Azure Service Bus**, not NATS/Redis, for + subscription filters + DLQ) for push dispatch + backpressure in a + coordinator-owns-scheduling / broker-owns-delivery hybrid (keeps the scorer). +5. **M3: autoscaling** — scale-to-zero ephemeral factories (KEDA/Container Apps) + keyed to subscription depth. 6. **Capability marketplace** — route rare-capability jobs (xcode/figma/gpu) to the few factories that have them; cross-product queueing fairness. 7. **Load + chaos suite** — factory churn, broker outage, thundering herd. @@ -362,15 +373,13 @@ agent-queue/ ## 14. Bugs, gaps & risks (be honest) -**Documentation drift (highest-signal, easy to fix):** -- `GIGAFACTORY_ROADMAP.md` **§0 progress table is wrong** — shows Phase 3 "0% not - started" while all Phase-3 boxes are ticked, and Phase 1/2 percentages (95%/80%) - understate reality. -- **Phase-2 boxes 384 & 386 are unticked but done in code.** `coordinator.ts` - imports/uses `selectJob` + `selectPreemptionVictim` in `claimNextJob`; `routes.ts` - enforces `enrollment.enforceFactoryToken` on claim/heartbeat and exposes - enroll/rotate/revoke. The roadmap's "remaining for 100%" note on line 390 is - outdated. +**Documentation status (reconciled 2026-05-31):** +- `GIGAFACTORY_ROADMAP.md` §0 now reads Phase 0 ✅100% · 1 ✅~98% · 2 ✅~98% · + 3 ✅100% · **4 ◐ in progress (~10%, M0 shipped)** · 5 ☐. Phase-2 boxes for the + scheduler core and factory enrollment/scoped tokens are ticked (`scheduler.ts` + `selectJob`/`selectPreemptionVictim` wired into `claimNextJob`; `enrollment.ts` + `enforceFactoryToken` gating claim/heartbeat). The earlier "stale §0 table" + warning no longer applies. **Runtime / correctness gaps:** - **SSE is poll-fallback based, not a push-only contract.** `subscribeJobEvents` @@ -397,14 +406,38 @@ agent-queue/ - TUI fleet mode has **no write path for budgets/preemption** — it's read + job actions only; budget pause/resume is web-only. -**Operational / not-yet-built (expected, Phase 4+):** -- **No message bus** — dispatch is poll-based; no push/backpressure yet. -- **No autoscaling** — factory fleet is static/manually run. +**Operational gotchas (verified on the live fleet — get these right):** +- **Heartbeat cadence MUST be < the 90s stale threshold.** `fleet_metrics` marks a + factory stale after `DEFAULT_STALE_FACTORY_MS = 90_000`, but the factory only + heartbeats every `AQ_FLEET_LEASE_RENEW_SEC` (**default 300s**). Left at the + default, a healthy factory flaps to "stale"/"no live factory" between beats. The + fleet launcher sets `AQ_FLEET_LEASE_RENEW_SEC=30` to stay well inside the window. +- **The tracker-web New-Job form is misconfigured:** it hardcodes factories + `mac-1`/`mac-2` and defaults `capabilities=["build"]` — a token **no agent-queue + factory advertises** (`detect_capabilities` emits `os:*`/`engine:*`/`node:*`/`has:*`). + So a default UI submission is unroutable (queues forever → `queue_starvation`). + Fix tracked in the redesign doc's routing-model section. +- **No factory deregister API.** Only heartbeat/enroll/rotate/revoke exist, so a + dead factory's doc lingers and shows as `stale` until pruned out-of-band + (currently a manual Cosmos delete). A prune/deregister path is a Phase-4 item. + +**Not-yet-built (expected, Phase 4+):** +- **No message bus yet** — dispatch is still poll-based, but the **M0 RU gate now + skips the claim while idle** (so idle Cosmos RU is near-flat). Broker push/ + backpressure is M1+. +- **No autoscaling** — factory fleet is static/manually run (M3 target). - **No capability marketplace / cross-product fairness** under contention. - **No load/chaos test suite** — resilience is unit-proven, not load-proven. - **Artifacts blob wiring** (`fleet_artifacts` → real blob storage) should be validated against a live storage account (tests use memory/pointer only). +**Recently fixed (2026-05-31):** +- **`run --once` could return before a backgrounded worker finished the PR/report.** + `_meta_end` (which writes `ended=`) was called right after the `testing/` move, + *before* PR open/merge + coordinator reports, so the slot freed early and `--once` + could exit (and a caller could observe completion) mid-PR. Now `ended=` is written + last; the selftest PR-mode case is deterministic again. + --- ## 15. TL;DR