From 71e5ad69233db158dd3591637e9413cb4c18ed7d Mon Sep 17 00:00:00 2001 From: Saravanakumar D Date: Sat, 30 May 2026 20:11:02 -0700 Subject: [PATCH] docs(gigafactory): add system overview with architecture diagrams; sync roadmap status MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add GIGAFACTORY_SYSTEM_OVERVIEW.md — a current-state companion to the roadmap spec covering: what the Agent Gigafactory is, a completion snapshot, three Mermaid diagrams (component architecture, job-lifecycle state machine, atomic claim + lease-fencing sequence), the Cosmos data model, the scoring router, subsystem map, full /fleet REST surface, feature flags, the two control planes, a cross-repo code map, test coverage, next steps (Phase 4/5), and an honest bugs/gaps/risks section. All three Mermaid blocks validated with mermaid.parse. Also correct documentation drift in GIGAFACTORY_ROADMAP.md found during the review: - §0 progress table showed Phase 3 as "0% not started" while every Phase-3 box is ticked; updated phases 1-3 to done with realistic percentages. - Phase-2 boxes "scheduler/router wired into assignment", "tracker adapter direct call", and "factory enrollment + scoped tokens" are implemented in common-plat (coordinator.ts uses selectJob; routes.ts enforces enrollment.enforceFactoryToken; tracker-bridge.ts) but were left unticked — ticked with evidence and refreshed the stale "remaining for 100%" notes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- agent-queue/docs/GIGAFACTORY_ROADMAP.md | 23 +- .../docs/GIGAFACTORY_SYSTEM_OVERVIEW.md | 418 ++++++++++++++++++ 2 files changed, 430 insertions(+), 11 deletions(-) create mode 100644 agent-queue/docs/GIGAFACTORY_SYSTEM_OVERVIEW.md diff --git a/agent-queue/docs/GIGAFACTORY_ROADMAP.md b/agent-queue/docs/GIGAFACTORY_ROADMAP.md index 5c343ca..42ee181 100644 --- a/agent-queue/docs/GIGAFACTORY_ROADMAP.md +++ b/agent-queue/docs/GIGAFACTORY_ROADMAP.md @@ -11,13 +11,13 @@ | Phase | Theme | Status | % | Gate | | ----- | ----- | ------ | - | ---- | | **0** | Baseline (today) | ✅ shipped | 100% | `selftest.sh` green | -| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 95% | adapter e2e + selftest | -| **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ◐ in progress | 80% | fleet e2e + module tests | -| **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests | +| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ✅ done | ~98% | adapter e2e + selftest | +| **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ✅ done | ~98% | fleet e2e + module tests | +| **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ✅ done | 100% | web e2e + router tests | | **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite | | **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B | -Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.** +Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.** For the full current-state architecture, diagrams, code map, next steps and known gaps see **`GIGAFACTORY_SYSTEM_OVERVIEW.md`** (companion doc). --- @@ -373,21 +373,22 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until > reports **fenced** stage transitions with WIP checkpoints, renews/releases leases, and on > a stale `leaseEpoch` (reclaimed) **self-aborts + quarantines** the local result. Coordinator > 5xx/connection errors **degrade** (finish locally) rather than abandon work. When `AQ_FLEET` -> is off the offline git-queue path is byte-for-byte unchanged. Remaining P2: scheduler/router -> core, direct tracker→module calls, factory enrollment + scoped tokens, `fleet.*` feature -> flags + shadow/dual-run, and the two-factory parallel demo (the Phase-2 exit criteria). +> is off the offline git-queue path is byte-for-byte unchanged. The remaining P2 items — +> scheduler/router core, direct tracker→module calls, factory enrollment + scoped tokens, +> `fleet.*` feature flags + shadow/dual-run, and the two-factory parallel demo — are now all +> landed in common-plat (`scheduler.ts`, `tracker-bridge.ts`, `enrollment.ts`). - [x] Scaffold `fleet`/`orchestrator` module in `platform-service` (`types/repository/routes`, Zod, ESM, `productId`). *(PR #28)* - [x] Cosmos containers (§13) + repository layer (memory + Cosmos providers). *(PR #28; `fleet_artifacts` blob wiring still pending.)* - [x] **Atomic claim** (optimistic concurrency / `_etag`) + **lease reaper** + **fencing (`leaseEpoch`)** endpoints (§4/§8/§9) — *not* Cosmos-TTL-driven reclaim. *(common-plat PR #28 + #29; truly atomic via `updateIfMatch`.)* - [x] Port `agent-queue` runner to a **factory agent** API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback. *(P2-S3: `lib/fleet-client.sh` behind `AQ_FLEET`; registers via heartbeat, claims into inbox, reports fenced stage transitions, renews leases, quarantines on stale-epoch; offline git-queue unchanged when the flag is off.)* -- [ ] Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment. -- [ ] Tracker adapter calls the module directly (not just file export). -- [ ] Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset). +- [x] Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment. *(common-plat `fleet/scheduler.ts` pure `selectJob`/`scoreCandidate`/`selectPreemptionVictim`; `coordinator.ts` `claimNextJob` ranks candidates via `selectJob` after the capability hard-filter.)* +- [x] Tracker adapter calls the module directly (not just file export). *(common-plat `fleet/tracker-bridge.ts` + `POST /fleet/tracker/ingest` / `/fleet/tracker/echo`: idempotent ingest of a tracker item → job and one-way status echo, in-module.)* +- [x] Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset). *(common-plat `fleet/enrollment.ts`: `enrollFactory`/`rotateToken`/`revokeToken` issue a plaintext token once, store it hashed, scope it to `{productId, factoryId, capabilities}`; `enforceFactoryToken` gates `claim`/`heartbeat` in `routes.ts`.)* - [x] **Feature flags** (`fleet.enabled`, `fleet.route_via_service`) + **shadow/dual-run** vs P1 before cutover (§21). *(agent-queue runner: `AQ_FLEET` / `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` with documented precedence; shadow claim/compare/report is side-effect-free (isolated `-shadow` factoryId + dryRun, never materializes/ships); `fleet-shadow-report` summarizes AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY + agreement; 60→68 selftest checks.)* - [x] Module test suite (repository + routes via `@bytelyst/testing`); **atomic-claim race**, crash-recovery, fencing-rejection, reaper-reclaim tests. *(PR #28 + #29: 53 fleet + 48 datastore tests, incl. true-concurrency claim.)* - [x] Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end. *(`agent-queue/demo/two-factory-demo.sh` + `coordinator-stub.sh`: two real `run` daemons (mac-1 + ubuntu-1, separate queues/cwds) compete through one coordinator; asserts (a) no double-assign, (b) kill-mid-job → reaper reclaim → survivor completes → zombie report fenced (409), (c) concurrent parallelism. Dual-mode: CI-safe stateful stub by default, live platform-service when `AQ_FLEET_API`/`AQ_FLEET_TOKEN` set. Headless checks in `selftest.sh` → 68→71 green.)* -- **Exit criteria:** all boxes ✅; `pnpm --filter @lysnrai/platform-service test` green; killing a factory mid-job → another reclaims and completes **and the dead worker's late report is fenced**; concurrent claimers never double-assign; all state in Cosmos with `productId`; **flag-off rollback verified** (§21). — _Runtime exit guarantees **demonstrated** by the two-factory demo (no double-assign + reclaim/fence + parallelism) and flag-off rollback verified (§21). **Remaining for 100%:** scheduler/router core wired into assignment (common-plat PR #31, open), tracker adapter direct call, and factory enrollment + scoped tokens._ +- **Exit criteria:** all boxes ✅; `pnpm --filter @lysnrai/platform-service test` green; killing a factory mid-job → another reclaims and completes **and the dead worker's late report is fenced**; concurrent claimers never double-assign; all state in Cosmos with `productId`; **flag-off rollback verified** (§21). — _Runtime exit guarantees **demonstrated** by the two-factory demo (no double-assign + reclaim/fence + parallelism) and flag-off rollback verified (§21). Scheduler/router core, tracker-module direct calls, and factory enrollment + scoped tokens are now all wired in (see boxes above) — Phase 2 is effectively complete. **Remaining for a hard 100%:** validate the Cosmos `_etag` CAS path under true production contention + live blob-backed `fleet_artifacts`._ ### Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router **Goal:** one browser control plane; smart routing + budgets live. diff --git a/agent-queue/docs/GIGAFACTORY_SYSTEM_OVERVIEW.md b/agent-queue/docs/GIGAFACTORY_SYSTEM_OVERVIEW.md new file mode 100644 index 0000000..32edf32 --- /dev/null +++ b/agent-queue/docs/GIGAFACTORY_SYSTEM_OVERVIEW.md @@ -0,0 +1,418 @@ +# Agent Gigafactory — System Overview (current picture) + +> Companion to `GIGAFACTORY_ROADMAP.md` (the source-of-truth spec & checklists). +> This document describes **what is actually built today**, how the pieces fit +> together, the architecture diagrams, the code map across both repos, the next +> steps, and the known bugs/gaps. Last reviewed: **2026-05-30**. + +--- + +## 1. What it is (in one paragraph) + +The **Agent Gigafactory** turns a single-host "folder queue" agent runner into a +**distributed fleet** of agent "factories" (machines: mac/ubuntu/windows) that +claim and execute coding jobs in parallel, coordinated by a durable, +product-agnostic service. A job is a markdown manifest (persona + capabilities + +budget + deps); the **coordinator** assigns each job to the best-fit factory via a +deterministic scoring router, guarantees **exactly-once assignment** through +optimistic-concurrency claims + **leases with epoch fencing**, recovers crashed +work automatically (reaper + WIP checkpoints), enforces **per-product budgets**, +supports **DAG decomposition** (composite → child jobs), and exposes the whole +fleet through **two control planes**: a browser UI (`tracker-web`) and a terminal +TUI (`agent-queue` dashboard). Both control planes talk to the same `/fleet` REST +API. + +--- + +## 2. Completion snapshot (reality, not the stale table) + +| Phase | Theme | Real status | Notes | +| ----- | ----- | ----------- | ----- | +| **0** | Single-host baseline | ✅ 100% | `agent-queue.sh` folder queue, selftest green | +| **1** | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node `dash` field surfacing — **now also done** via fleet-dash tags. Effectively complete | +| **2** | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are **done in code** but boxes 384/386 unticked in roadmap (see §11 Gaps) | +| **3** | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run | +| **4** | Message bus + autoscaling + capability marketplace | ☐ 0% | Not started — next major frontier | +| **5** | Self-optimizing / learned routing | ☐ 0% | Not started | + +> ⚠️ The **`GIGAFACTORY_ROADMAP.md` §0 progress table is stale** — it shows +> Phase 3 as "0% not started" although every Phase-3 box below it is ticked. See +> §11 (Bugs & Gaps) — this should be corrected. + +--- + +## 3. System architecture + +```mermaid +graph TB + subgraph CP["Control planes (operators)"] + WEB["tracker-web Fleet UI
(Next.js, /dashboard/fleet/*)"] + TUI["agent-queue TUI
(dashboard.mjs, AQ_FLEET_DASH=1)"] + end + + subgraph SVC["platform-service — fleet module (the spine)"] + ROUTES["routes.ts
/fleet REST + SSE"] + COORD["coordinator.ts
claim · lease · fence · reaper
preemption · budgets · DAG · review"] + SCHED["scheduler.ts
pure scoring router (§7)"] + ENROLL["enrollment.ts
factory tokens (scoped, rotatable)"] + BRIDGE["tracker-bridge.ts
job ↔ tracker item"] + ARTIF["artifacts.ts / artifacts-blob.ts
pointer + blob bytes"] + REPO["repository.ts
CAS (rev/_etag) CRUD"] + end + + subgraph DATA["@bytelyst/datastore (Cosmos / memory)"] + JOBS[("fleet_jobs")] + RUNS[("fleet_runs")] + LEASES[("fleet_leases")] + FAC[("fleet_factories")] + PROF[("fleet_profiles")] + EVENTS[("fleet_events")] + ARTDOCS[("fleet_artifacts")] + end + + subgraph FLEET["Factory agents (workers, N hosts)"] + F1["agent-queue.sh + lib/fleet-client.sh
(AQ_FLEET=1) — mac-1"] + F2["agent-queue.sh + lib/fleet-client.sh
ubuntu-1"] + ENGINES["engines: claude · codex · devin"] + end + + WEB -->|/api/fleet proxy| ROUTES + TUI -->|lib/fleet-dash.mjs| ROUTES + ROUTES --> COORD + COORD --> SCHED + ROUTES --> ENROLL + ROUTES --> BRIDGE + ROUTES --> ARTIF + COORD --> REPO + ENROLL --> REPO + BRIDGE --> REPO + ARTIF --> ARTDOCS + REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS + + F1 -->|heartbeat · claim · patch fenced · renew| ROUTES + F2 -->|heartbeat · claim · patch fenced · renew| ROUTES + F1 --> ENGINES + F2 --> ENGINES +``` + +**Layering principle:** `scheduler.ts` is **pure** (no I/O — all inputs passed +in), `coordinator.ts` is the orchestration core, `repository.ts` is the only thing +that touches the datastore, and `routes.ts` is the only thing that touches HTTP. +Factories never touch the DB directly — they only call REST. + +--- + +## 4. Job lifecycle (stages) + +```mermaid +stateDiagram-v2 + [*] --> queued: submitJob + queued --> blocked: unmet deps + blocked --> queued: deps satisfied (reaper/unblock) + queued --> assigned: claimNextJob (CAS win + lease) + assigned --> building: factory starts (patch fenced) + building --> review: rc=0 → review gate + building --> testing: verify-pass (auto) + review --> testing: approve / requestReview quorum + testing --> shipped: ship (manual gate) + building --> failed: verify-fail / budget_exceeded / timeout + review --> failed: reject + assigned --> queued: lease expired (reaper, +epoch, keep checkpoint) + building --> queued: preempted (critical job, checkpoint + epoch bump) + failed --> queued: requeue (operator) + failed --> dead_letter: retries exhausted + shipped --> [*] + dead_letter --> [*] +``` + +Stages (`types.ts`): `queued · blocked · assigned · building · review · testing · +shipped · failed · dead_letter`. The TUI/local board collapse these onto kanban +buckets (`inbox/building/review/testing/shipped/failed`) for parity. + +--- + +## 5. The core guarantee — atomic claim + lease fencing + +This is the heart of "no double-assignment, ever" and "a dead worker can never +corrupt a reassigned job." + +```mermaid +sequenceDiagram + participant FA as Factory A + participant FB as Factory B + participant CO as coordinator + participant DB as fleet_jobs / fleet_leases + + FA->>CO: POST /fleet/claim (caps) + FB->>CO: POST /fleet/claim (caps) + CO->>DB: selectJob() → job J (rev=5) + CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS) + DB-->>CO: A wins (rev→6, leaseEpoch=1) + CO->>DB: revUpdate J IF rev==5 (B's CAS) + DB-->>CO: conflict (B re-selects) + CO-->>FA: assigned J (leaseEpoch=1) + CO-->>FB: conflict → next job + + Note over FA: A crashes mid-build + CO->>DB: reapExpiredLeases(): lease expired → J back to queued,
leaseEpoch=2, checkpoint preserved + FB->>CO: claim → J (leaseEpoch=2) + FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1 + CO-->>FA: 409 fenced (1 < 2) — rejected +``` + +- **CAS:** `repository.revUpdateJob/revUpdateLease` write only if stored `rev` + matches (Cosmos `_etag`/`If-Match`; memory provider re-reads `rev`). +- **Fencing:** every worker mutation carries `leaseEpoch`; epoch `< job.leaseEpoch` + ⇒ `fenced` (409). +- **Reaper:** `reapExpiredLeases(now)` requeues expired-lease jobs, **bumps the + epoch**, and **keeps the `checkpoint`** (WIP git branch pointer) so work resumes + rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery. + +--- + +## 6. Data model (Cosmos containers) + +| Container | PK | Purpose | +| --------- | -- | ------- | +| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `depsMode`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, `kind`, `parentId` | +| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) | +| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` | +| `fleet_factories` | `/productId` | worker host: `capabilities[]`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` | +| `fleet_profiles` | `/productId` | immutable, versioned persona/capability profile snapshot | +| `fleet_events` | `/jobId` | append-only audit stream (monotonic `seq`) — powers SSE | +| `fleet_artifacts` | `/jobId` | **pointers** to blob-stored artifacts (no inline logs) | + +Every document carries `productId`. Containers registered in `lib/cosmos-init.ts`. + +--- + +## 7. The scheduler / scoring router (`scheduler.ts`) + +Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in +Phase 5). Filter → score → rank: + +``` +score = w1·capabilityFit + w2·affinity + w3·(1/(1+load)) + + w4·costFit(budget) + w5·health − w6·starvationPenalty(age) +``` + +Default weights (`DEFAULT_WEIGHTS`): `capabilityFit 1.0 · affinity 0.5 · load 1.0 +· costFit 0.75 · health 1.0 · starvation 1.5`. Capability is a **hard filter** +(subset check); `down` factories are filtered out, not scored; aging fully +de-penalises after ~30 min (anti-starvation). `scoreCandidate` returns a per-term +breakdown that powers the **explainability** panel (`GET /fleet/jobs/:id/explain` +→ `ExplainPanel`). `selectPreemptionVictim` picks the lowest-priority running job a +critical job may evict (under `FLEET_PREEMPTION`). + +--- + +## 8. Subsystems at a glance + +| Subsystem | File(s) | What it does | Flag | +| --------- | ------- | ------------ | ---- | +| Claim / lease / fence / reaper | `coordinator.ts` | exactly-once assignment, recovery | — | +| Scoring router + preemption | `scheduler.ts`, `coordinator.ts` | best-fit assignment, evict low-pri for critical | `FLEET_PREEMPTION` | +| Per-product budgets | `coordinator.ts` (`accrueSpend`, `pause/resume`) | ceiling + auto-pause kill-switch; burndown | `FLEET_BUDGETS` | +| DAG decomposition | `coordinator.ts` (`submitChildren`, `getDagSubtree`, `maybeUnblockParent`) | composite job fans out to children; deps gate parent | — | +| Review gate | `coordinator.ts` (`requestReview`, `submitReview`) | multi-reviewer quorum before ship | — | +| Factory enrollment | `enrollment.ts` | scoped, rotatable, hashed tokens; auth on claim/heartbeat | — | +| Tracker bridge | `tracker-bridge.ts` | idempotent ingest of tracker item → job; one-way status echo | — | +| Artifacts | `artifacts.ts`, `artifacts-blob.ts` | pointer docs in Cosmos, bytes in blob (SAS) | — | +| Live events | `routes.ts` SSE + `fleet_events` | `GET /fleet/jobs/:id/events/stream` | — | +| Metrics / alerts | `coordinator.ts` (`fleetMetrics`) | utilization, health rollup, starvation alerts | — | + +--- + +## 9. REST API surface (`/fleet`, under `/api`, auth + `x-product-id`) + +``` +Jobs POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id + PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action +Claim POST /fleet/claim +Lease POST /fleet/jobs/:id/lease/renew · /lease/release +Factories POST /fleet/factories/heartbeat · /enroll + POST /fleet/factories/:id/token/rotate · /token/revoke +Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain +Review POST /fleet/jobs/:id/review/request · /review +Budgets GET /fleet/budgets/:productId · /burndown + PUT /fleet/budgets/:productId · POST /pause · /resume +DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag +Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id +Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo +Metrics GET /fleet/metrics +``` + +--- + +## 10. The two control planes & feature flags + +**Browser (`tracker-web`)** — `dashboards/tracker-web/src/`: +- `app/dashboard/fleet/page.tsx` — fleet map (factory cards, health/load/caps, metrics + alerts) +- `app/dashboard/fleet/jobs/page.tsx` — stage-filtered job table +- `app/dashboard/fleet/jobs/[id]/page.tsx` — job detail: SSE event timeline, runs, artifacts, **DAG view**, **ExplainPanel**, **ReviewGateCard**, ship/requeue/reject +- `app/dashboard/fleet/budget/page.tsx` — burndown chart + pause/resume kill-switch +- `lib/fleet-client.ts` — typed client; `subscribeJobEvents` (fetch-based SSE w/ auth + `Last-Event-ID` resume + poll fallback); graceful 404 → null +- `app/api/fleet/[...path]/route.ts` — proxy to platform-service + +**Terminal (`agent-queue`)** — `learning_ai_devops_tools/agent-queue/`: +- `dashboard.mjs` (`AQ_FLEET_DASH=1`) → `lib/fleet-dash.mjs` adapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via `/fleet`. Local folder-queue mode byte-for-byte unchanged when the flag is off. + +**Feature flags** + +| Flag | Where | Effect | +| ---- | ----- | ------ | +| `FLEET_PREEMPTION` | platform-service | enable critical-job preemption + seat limits | +| `FLEET_BUDGETS` | platform-service | enable budget enforcement + auto-pause | +| `AQ_FLEET` | factory runner | runner becomes a coordinator factory (claim/report) | +| `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` | factory runner | route via service / side-effect-free shadow compare | +| `AQ_FLEET_DASH` | TUI | dashboard sources board from `/fleet` API | +| `AQ_FLEET_API` / `AQ_FLEET_TOKEN` / `AQ_PRODUCT_ID` | both | base URL / bearer / `x-product-id` | + +All flags default **off** → the system is byte-for-byte the prior single-host tool. + +--- + +## 11. Code map (where everything lives) + +**`learning_ai_common_plat` (the durable spine):** +``` +services/platform-service/src/modules/fleet/ + types.ts Zod schemas + canonical model (stages, lease, budget, DAG, events) + repository.ts per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent + coordinator.ts submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics + scheduler.ts pure scoring router + selectPreemptionVictim + scoreCandidate (explain) + enrollment.ts factory enroll / rotate / revoke / enforceFactoryToken + tracker-bridge.ts ingest tracker item → job; one-way status echo + artifacts.ts artifact pointer mgmt + artifacts-blob.ts blob upload/download/delete (SAS) + routes.ts all /fleet REST + SSE + *.test.ts coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types +dashboards/tracker-web/src/ + app/dashboard/fleet/** the browser control plane (pages above) + lib/fleet-client.ts typed client + SSE + app/api/fleet/[...path]/route.ts proxy + e2e/fleet.spec.ts Playwright specs +lib/cosmos-init.ts container registration +docs/gigafactory-phase3-progress.md / docs/FLEET_CONTROL_PLANE.md +``` + +**`learning_ai_devops_tools` (the factory agent + TUI + spec):** +``` +agent-queue/ + agent-queue.sh single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover + lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware) + lib/fleet-dash.mjs TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions) + dashboard.mjs the TUI (local + fleet modes) + profiles/*.md persona+capability catalog + demo/two-factory-demo.sh + coordinator-stub.sh parallel-fleet demo + selftest.sh ~75 dependency-light checks + docs/GIGAFACTORY_ROADMAP.md source-of-truth spec & checklists + docs/GIGAFACTORY_SYSTEM_OVERVIEW.md (this file) +``` + +--- + +## 12. Test coverage (what's verified) + +- **platform-service fleet** (~134+ tests): atomic-claim race (true concurrency, no + double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring + / tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree, + budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle + + auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain), + schema validation. +- **tracker-web** (~198 tests): fleet-client unit tests + page render; SSE + parse/resume/fallback; graceful 404 degradation. +- **tracker-web e2e** (`e2e/fleet.spec.ts`): fleet map, live log, ship, budget-pause, + review-gate (Playwright — needs CI wiring). +- **agent-queue** (`selftest.sh`, ~75 checks): manifest/profiles/caps/priority/deps/ + idempotency, retry/recover/insights, tracker round-trip, `AQ_FLEET` register/claim/ + fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo, + **budget.wall enforcement**, **fleet-dash adapter (22 assertions)**. + +--- + +## 13. Next steps + +**Immediate (close Phase 1–3 to a clean 100%):** +1. **Fix the stale roadmap §0 table** and tick Phase-2 boxes 384 (scheduler wired — + `selectJob` is used in `claimNextJob`) and 386 (enrollment + scoped tokens — + `enrollment.ts` + `enforceFactoryToken` are wired). (See §11 Gaps.) +2. **Wire `e2e/fleet.spec.ts` into CI** (Playwright install + a `verify` job) so the + Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just + present. +3. **Live multi-host operator run** end-to-end (the Phase-3 acceptance: drive the + 3-repo parallel workload from the browser, including a budget pause + resume + against a real platform-service, not the stub). + +**Phase 4 (scale-out) — the next major frontier:** +4. Introduce a **broker** (NATS/Redis) for push dispatch + backpressure; coordinator + publishes, factories subscribe by capability (fallback to poll on outage). +5. **Autoscaling hooks** — spin ephemeral factories (cloud VM/container) keyed to + queue depth + SLA. +6. **Capability marketplace** — route rare-capability jobs (xcode/figma/gpu) to the + few factories that have them; cross-product queueing fairness. +7. **Load + chaos suite** — factory churn, broker outage, thundering herd. + +**Phase 5 (learned routing):** +8. Capture per-run outcome features → offline eval harness (learned vs heuristic) → + shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to + claude on mac-2: 23% faster"). + +--- + +## 14. Bugs, gaps & risks (be honest) + +**Documentation drift (highest-signal, easy to fix):** +- `GIGAFACTORY_ROADMAP.md` **§0 progress table is wrong** — shows Phase 3 "0% not + started" while all Phase-3 boxes are ticked, and Phase 1/2 percentages (95%/80%) + understate reality. +- **Phase-2 boxes 384 & 386 are unticked but done in code.** `coordinator.ts` + imports/uses `selectJob` + `selectPreemptionVictim` in `claimNextJob`; `routes.ts` + enforces `enrollment.enforceFactoryToken` on claim/heartbeat and exposes + enroll/rotate/revoke. The roadmap's "remaining for 100%" note on line 390 is + outdated. + +**Runtime / correctness gaps:** +- **SSE is poll-fallback based, not a push-only contract.** `subscribeJobEvents` + falls back to `getJobEvents()` polling on stream error — fine for resilience, but + "live" can silently degrade to polling without a visible operator signal. +- **UI pages degrade silently on some errors** (empty states / `null`), which can + mask a real backend outage as "nothing happening." +- **Budget page assumes `ceilingUsd` exists** when rendering the spend bar — a + budget doc without a ceiling could render a broken/NaN bar. Guard it. +- **Dashboard `patchJob` only sends `{stage, leaseEpoch}`** — other fenced-transition + fields (e.g. `checkpoint`) aren't exposed in the web UI, so operator-driven + transitions can't carry a checkpoint. +- **`rev` CAS on the memory provider** is exact only for the sequential calls the + coordinator/tests make (re-read `rev` before write). Real concurrency safety + depends on Cosmos `_etag`/`If-Match` in production — verify the Cosmos path under + true contention before relying on it at scale. + +**TUI-specific (this repo):** +- Fleet **utilization %** only renders in the metrics-aggregate fallback branch, not + when per-factory rows are present — a minor inconsistency in the TUI board. +- The **budget.wall live selftest is timing-sensitive** (races a 2s wall ceiling) and + can flake under heavy disk/CPU load; the code is correct but the test could be made + more robust (e.g. inject the clock). +- TUI fleet mode has **no write path for budgets/preemption** — it's read + job + actions only; budget pause/resume is web-only. + +**Operational / not-yet-built (expected, Phase 4+):** +- **No message bus** — dispatch is poll-based; no push/backpressure yet. +- **No autoscaling** — factory fleet is static/manually run. +- **No capability marketplace / cross-product fairness** under contention. +- **No load/chaos test suite** — resilience is unit-proven, not load-proven. +- **Artifacts blob wiring** (`fleet_artifacts` → real blob storage) should be + validated against a live storage account (tests use memory/pointer only). + +--- + +## 15. TL;DR + +Phases 0–3 are functionally **complete and well-tested**: a durable coordinator with +exactly-once leasing + fencing + crash recovery, a deterministic scoring router with +preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer +gate, factory enrollment with scoped tokens, and **two** control planes (browser + +TUI) over one `/fleet` API. The remaining work is (a) trivial doc corrections, (b) +CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier +(broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.