# Gigafactory — Roadmap Completion Audit > Source of truth: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md` > Audit date: 2026-05-30 · Auditor: Principal Full-Stack review > Scope: `services/platform-service/src/modules/fleet/**` + `dashboards/tracker-web/src/app/dashboard/fleet/**` ## 1. Product understanding The **Agent Gigafactory** is a distributed system that turns work items (tracker Items, manifests) into **jobs** executed in parallel by a fleet of **factories** (agent runners on mac/ubuntu/windows). A coordinator (the `fleet` module in `platform-service`) owns durable job state in Cosmos, hands jobs to factories via **atomic leases** with **fencing tokens**, recovers crashed work via a **reaper**, and echoes status back to the tracker. A browser **control plane** in `tracker-web` lets operators watch and steer the fleet. ## 2. Current architecture ``` tracker Item ──ingest──▶ fleet coordinator ──lease/claim──▶ factory agents (bash runner, AQ_FLEET) │ (platform-service) │ ├─ Cosmos: jobs/runs/leases/ ├─ heartbeat / renew / release │ factories/events/artifacts/budgets └─ fenced stage transitions └─ tracker-bridge ──echo──▶ tracker Item tracker-web /dashboard/fleet ──/api/fleet proxy──▶ fleet REST (24 endpoints) ``` - **Module layout** (canonical pattern): `types.ts → repository.ts → coordinator.ts → scheduler.ts → routes.ts` plus `tracker-bridge.ts`, `enrollment.ts`, `artifacts.ts`. - **Concurrency core:** optimistic `rev`/`_etag` updates, `leaseEpoch` fencing, lease reaper. - **Scheduler:** pure `selectJob` / `scoreCandidate` with capability hard-filter + weighted scoring; Phase-3 `resolveWeights` (per-product/request) + `selectPreemptionVictim`. - **Feature flags (default OFF):** `FLEET_PREEMPTION`, `FLEET_BUDGETS`, `FLEET_TRACKER_ECHO`, `FLEET_REQUIRE_FACTORY_TOKEN`. ## 3. Verified test/build status (baseline) | Check | Command | Result | | --------------------- | --------------------------------------------------------------------------- | ------------------------ | | Fleet module tests | `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet` | ✅ 134 passing (8 files) | | Platform-service full | `pnpm --filter @lysnrai/platform-service test` | ✅ 1646 passing | | tracker-web tests | `pnpm --filter @bytelyst/tracker-web test` | ✅ 198 passing | | Monorepo build | `pnpm build` | ✅ green | ## 4. Completed features (verified in code) ### Phase 1 — single-host runner (95% per roadmap §0) - ✅ Manifest parsing, priority ordering, capability match, engine-class, idempotency dedupe - ✅ Profiles + resolution, deps/DAG blocking + cycle detection, warn-only allowed-scope - ✅ Crash recovery (`recover_orphans`), WIP checkpoint/resume, retry w/ backoff, insights - ✅ Tracker adapter (`from-tracker`/`to-tracker`, idempotent, non-fatal echo) ### Phase 2 — coordinator module (roadmap says 80%, code shows ~95%) - ✅ `fleet` module scaffolded; Cosmos containers + repository (memory + Cosmos providers) - ✅ Atomic claim (`revUpdateJob`/`_etag`) + lease reaper + fencing (`leaseEpoch`) - ✅ Factory-agent API client (`lib/fleet-client.sh` behind `AQ_FLEET`) - ✅ **Scheduler/router core wired** — `coordinator.claimNextJob` calls `selectJob` (coordinator.ts:502) - ✅ **Tracker adapter direct call** — `tracker-bridge.ts` `ingestItemAsJob`/`echoJobToItem` - ✅ **Factory enrollment + scoped tokens** — `enrollment.ts` + `/fleet/factories/enroll|rotate|revoke` - ✅ Feature flags + shadow/dual-run; two-factory demo; module test suite ### Phase 3 — control plane + DAG + budgets + scoring (roadmap says 0%, code shows ~90%) - ✅ Tunable scoring weights (`resolveWeights`, per-product registry + request override) - ✅ Preemption behind `FLEET_PREEMPTION` (`selectPreemptionVictim`) - ✅ DAG decomposition — `POST /fleet/jobs/:id/children`, `GET /fleet/jobs/:id/dag`, parent block/unblock - ✅ Budgets — `FleetBudgetDoc`, GET/PUT/pause/resume, enforcement behind `FLEET_BUDGETS` - ✅ tracker-web fleet UI — overview, jobs table, job detail, budget pages + typed client + proxy - ✅ Operator job actions (requeue/reject/cancel) — backend + UI (no lease held; fences worker) - ✅ Scoring explainability — `GET /fleet/jobs/:id/explain` + routing-score UI panel - ✅ Cost burndown — per-day series endpoint + chart with ceiling overlay - ✅ SSE live log streaming — `GET /fleet/jobs/:id/events/stream` (resumable) + `subscribeJobEvents` - ✅ Fleet Playwright e2e — `e2e/fleet.spec.ts` (overview, jobs, job-detail, budget, review gate) - ✅ Fleet metrics + alerting — `GET /fleet/metrics` + overview metrics/alerts panel (§17) - ✅ Multi-reviewer routing — review-policy human gate (`requestReview`/`submitReview`) + gate UI ## 5. Partial features (started, not complete) | Feature | What exists | What's missing | | ------------- | ------------------------------------ | ------------------------------------------------------- | | TUI dashboard | legacy TUI against single-host queue | re-point at `/fleet` API for parity (P3, separate repo) | ## 6. Missing features (not started) - **Phase 3:** TUI re-point at `/fleet` (in `learning_ai_devops_tools`) - **Phase 4:** message broker (NATS/Redis), autoscaling hooks, capability marketplace, load/chaos suite - **Phase 5:** outcome feature capture, offline eval harness, A/B weight tuning, recommendations - **Phase 1 leftovers:** `budget.wall` wall-clock enforcement; Node `dash` tag surfacing ## 7. Broken flows None found. Build + all test suites green at audit time. ## 8. Mock / stubbed flows - `dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts` proxies to `PLATFORM_API_URL` (default `http://localhost:4003`). UI **degrades gracefully** (404 → null/empty) when the fleet module is unreachable — this is intended, not a stub to replace. - No mock data baked into pages; all reads go through the typed `fleet-client.ts`. ## 9. Security risks - Factory token enforcement is **flag-gated** (`FLEET_REQUIRE_FACTORY_TOKEN`, default OFF). For production, enrollment tokens should be ON. Documented in `.env.example`. - Budget enforcement OFF by default — cost-runaway guard (§18) not active until `FLEET_BUDGETS=1`. ## 10. Deployment risks - `PLATFORM_API_URL` must be set for the tracker-web proxy in non-local environments. - Cosmos containers must exist with `/productId` partition keys before first write. - No live-log transport to blob yet (§17) — operators rely on polling. ## 11. Prioritized remaining work See `TASKS_TO_COMPLETE.md`. Highest-impact safe slices, in order: 1. **Operator job actions (requeue/reject/cancel)** — completes a Phase-3 §14 box; backend+UI; low risk 2. **Scoring explainability surfaced** — data already computed; additive endpoint + UI 3. **Cost burndown** — additive UI on existing budget data 4. **SSE live logs** — larger; needs streaming route + consumer 5. **Fleet Playwright e2e** — required for Phase-3 exit gate 6. Phase 4/5 — out of MVP scope; track as future ## 12. Note on roadmap drift The roadmap §0 tracker is **stale**: it shows Phase 3 at 0% and Phase 2 at 80%, but the code implements nearly all Phase-2 boxes and the core Phase-3 backend + UI. A docs reconciliation (ticking §14 boxes in `learning_ai_devops_tools`) is itself a tracked task (`p2-roadmap-tick`).