- fleet module README: add fleet_queue_state container + GET /fleet/queue-state and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale threshold (AQ_FLEET_LEASE_RENEW_SEC). - FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and /fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and /fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories); add enroll, metrics, and the M0 queue-state endpoint. - ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings. - README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
8.4 KiB
Gigafactory — Roadmap Completion Audit
Source of truth:
learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.mdAudit date: 2026-05-30 · Auditor: Principal Full-Stack review Scope:services/platform-service/src/modules/fleet/**+dashboards/tracker-web/src/app/dashboard/fleet/**⚠️ Status update (2026-05-31) — this audit is a point-in-time snapshot; some findings below are now superseded:
- The roadmap §0 tracker has since been reconciled — it now reads Phase 0 ✅100% · 1 ✅~98% · 2 ✅~98% · 3 ✅100% · 4 ◐ in progress · 5 ☐. The "§0 is stale / Phase 3 0% / Phase 2 80%" notes below are no longer accurate.
- Phase 4 is no longer "not started": M0 (RU gate) is shipped —
fleet_queue_state+GET /fleet/queue-state+AQ_FLEET_GATE. The broker (M1+) design + checklist live inlearning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md.
1. Product understanding
The Agent Gigafactory is a distributed system that turns work items (tracker Items, manifests)
into jobs executed in parallel by a fleet of factories (agent runners on mac/ubuntu/windows).
A coordinator (the fleet module in platform-service) owns durable job state in Cosmos, hands jobs
to factories via atomic leases with fencing tokens, recovers crashed work via a reaper,
and echoes status back to the tracker. A browser control plane in tracker-web lets operators
watch and steer the fleet.
2. Current architecture
tracker Item ──ingest──▶ fleet coordinator ──lease/claim──▶ factory agents (bash runner, AQ_FLEET)
│ (platform-service) │
├─ Cosmos: jobs/runs/leases/ ├─ heartbeat / renew / release
│ factories/events/artifacts/budgets └─ fenced stage transitions
└─ tracker-bridge ──echo──▶ tracker Item
tracker-web /dashboard/fleet ──/api/fleet proxy──▶ fleet REST (24 endpoints)
- Module layout (canonical pattern):
types.ts → repository.ts → coordinator.ts → scheduler.ts → routes.tsplustracker-bridge.ts,enrollment.ts,artifacts.ts. - Concurrency core: optimistic
rev/_etagupdates,leaseEpochfencing, lease reaper. - Scheduler: pure
selectJob/scoreCandidatewith capability hard-filter + weighted scoring; Phase-3resolveWeights(per-product/request) +selectPreemptionVictim. - Feature flags (default OFF):
FLEET_PREEMPTION,FLEET_BUDGETS,FLEET_TRACKER_ECHO,FLEET_REQUIRE_FACTORY_TOKEN.
3. Verified test/build status (baseline)
| Check | Command | Result |
|---|---|---|
| Fleet module tests | pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet |
✅ 134 passing (8 files) |
| Platform-service full | pnpm --filter @lysnrai/platform-service test |
✅ 1646 passing |
| tracker-web tests | pnpm --filter @bytelyst/tracker-web test |
✅ 198 passing |
| Monorepo build | pnpm build |
✅ green |
4. Completed features (verified in code)
Phase 1 — single-host runner (95% per roadmap §0)
- ✅ Manifest parsing, priority ordering, capability match, engine-class, idempotency dedupe
- ✅ Profiles + resolution, deps/DAG blocking + cycle detection, warn-only allowed-scope
- ✅ Crash recovery (
recover_orphans), WIP checkpoint/resume, retry w/ backoff, insights - ✅ Tracker adapter (
from-tracker/to-tracker, idempotent, non-fatal echo)
Phase 2 — coordinator module (roadmap says 80%, code shows ~95%)
- ✅
fleetmodule scaffolded; Cosmos containers + repository (memory + Cosmos providers) - ✅ Atomic claim (
revUpdateJob/_etag) + lease reaper + fencing (leaseEpoch) - ✅ Factory-agent API client (
lib/fleet-client.shbehindAQ_FLEET) - ✅ Scheduler/router core wired —
coordinator.claimNextJobcallsselectJob(coordinator.ts:502) - ✅ Tracker adapter direct call —
tracker-bridge.tsingestItemAsJob/echoJobToItem - ✅ Factory enrollment + scoped tokens —
enrollment.ts+/fleet/factories/enroll|rotate|revoke - ✅ Feature flags + shadow/dual-run; two-factory demo; module test suite
Phase 3 — control plane + DAG + budgets + scoring (roadmap says 0%, code shows ~90%)
- ✅ Tunable scoring weights (
resolveWeights, per-product registry + request override) - ✅ Preemption behind
FLEET_PREEMPTION(selectPreemptionVictim) - ✅ DAG decomposition —
POST /fleet/jobs/:id/children,GET /fleet/jobs/:id/dag, parent block/unblock - ✅ Budgets —
FleetBudgetDoc, GET/PUT/pause/resume, enforcement behindFLEET_BUDGETS - ✅ tracker-web fleet UI — overview, jobs table, job detail, budget pages + typed client + proxy
- ✅ Operator job actions (requeue/reject/cancel) — backend + UI (no lease held; fences worker)
- ✅ Scoring explainability —
GET /fleet/jobs/:id/explain+ routing-score UI panel - ✅ Cost burndown — per-day series endpoint + chart with ceiling overlay
- ✅ SSE live log streaming —
GET /fleet/jobs/:id/events/stream(resumable) +subscribeJobEvents - ✅ Fleet Playwright e2e —
e2e/fleet.spec.ts(overview, jobs, job-detail, budget, review gate) - ✅ Fleet metrics + alerting —
GET /fleet/metrics+ overview metrics/alerts panel (§17) - ✅ Multi-reviewer routing — review-policy human gate (
requestReview/submitReview) + gate UI
5. Partial features (started, not complete)
| Feature | What exists | What's missing |
|---|---|---|
| TUI dashboard | legacy TUI against single-host queue | re-point at /fleet API for parity (P3, separate repo) |
6. Missing features (not started)
- Phase 3: TUI re-point at
/fleet(inlearning_ai_devops_tools) - Phase 4: message broker (NATS/Redis), autoscaling hooks, capability marketplace, load/chaos suite
- Phase 5: outcome feature capture, offline eval harness, A/B weight tuning, recommendations
- Phase 1 leftovers:
budget.wallwall-clock enforcement; Nodedashtag surfacing
7. Broken flows
None found. Build + all test suites green at audit time.
8. Mock / stubbed flows
dashboards/tracker-web/src/app/api/fleet/[...path]/route.tsproxies toPLATFORM_API_URL(defaulthttp://localhost:4003). UI degrades gracefully (404 → null/empty) when the fleet module is unreachable — this is intended, not a stub to replace.- No mock data baked into pages; all reads go through the typed
fleet-client.ts.
9. Security risks
- Factory token enforcement is flag-gated (
FLEET_REQUIRE_FACTORY_TOKEN, default OFF). For production, enrollment tokens should be ON. Documented in.env.example. - Budget enforcement OFF by default — cost-runaway guard (§18) not active until
FLEET_BUDGETS=1.
10. Deployment risks
PLATFORM_API_URLmust be set for the tracker-web proxy in non-local environments.- Cosmos containers must exist with
/productIdpartition keys before first write. - No live-log transport to blob yet (§17) — operators rely on polling.
11. Prioritized remaining work
See TASKS_TO_COMPLETE.md. Highest-impact safe slices, in order:
- Operator job actions (requeue/reject/cancel) — completes a Phase-3 §14 box; backend+UI; low risk
- Scoring explainability surfaced — data already computed; additive endpoint + UI
- Cost burndown — additive UI on existing budget data
- SSE live logs — larger; needs streaming route + consumer
- Fleet Playwright e2e — required for Phase-3 exit gate
- Phase 4/5 — out of MVP scope; track as future
12. Note on roadmap drift
The roadmap §0 tracker is stale: it shows Phase 3 at 0% and Phase 2 at 80%, but the code
implements nearly all Phase-2 boxes and the core Phase-3 backend + UI. A docs reconciliation
(ticking §14 boxes in learning_ai_devops_tools) is itself a tracked task (p2-roadmap-tick).