saravanakumardb1 78c4e47460 docs(gigafactory): fix stale/incorrect fleet docs

- fleet module README: add fleet_queue_state container + GET /fleet/queue-state
  and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale
  threshold (AQ_FLEET_LEASE_RENEW_SEC).
- FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and
  /fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and
  /fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories);
  add enroll, metrics, and the M0 queue-state endpoint.
- ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and
  Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings.
- README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

2026-06-01 00:03:05 -07:00

8.4 KiB

Raw Blame History

Gigafactory — Roadmap Completion Audit

Source of truth: learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md Audit date: 2026-05-30 · Auditor: Principal Full-Stack review Scope: services/platform-service/src/modules/fleet/** + dashboards/tracker-web/src/app/dashboard/fleet/**

⚠️ Status update (2026-05-31) — this audit is a point-in-time snapshot; some findings below are now superseded:

The roadmap §0 tracker has since been reconciled — it now reads Phase 0 ✅100% · 1 ✅~98% · 2 ✅~98% · 3 ✅100% · 4 ◐ in progress · 5 ☐. The "§0 is stale / Phase 3 0% / Phase 2 80%" notes below are no longer accurate.

Phase 4 is no longer "not started": M0 (RU gate) is shipped — fleet_queue_state + GET /fleet/queue-state + AQ_FLEET_GATE. The broker (M1+) design + checklist live in learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md.

1. Product understanding

The Agent Gigafactory is a distributed system that turns work items (tracker Items, manifests) into jobs executed in parallel by a fleet of factories (agent runners on mac/ubuntu/windows). A coordinator (the fleet module in platform-service) owns durable job state in Cosmos, hands jobs to factories via atomic leases with fencing tokens, recovers crashed work via a reaper, and echoes status back to the tracker. A browser control plane in tracker-web lets operators watch and steer the fleet.

2. Current architecture

tracker Item ──ingest──▶ fleet coordinator ──lease/claim──▶ factory agents (bash runner, AQ_FLEET)
                          │  (platform-service)                    │
                          ├─ Cosmos: jobs/runs/leases/             ├─ heartbeat / renew / release
                          │  factories/events/artifacts/budgets    └─ fenced stage transitions
                          └─ tracker-bridge ──echo──▶ tracker Item
tracker-web /dashboard/fleet ──/api/fleet proxy──▶ fleet REST (24 endpoints)

Module layout (canonical pattern): types.ts → repository.ts → coordinator.ts → scheduler.ts → routes.ts plus tracker-bridge.ts, enrollment.ts, artifacts.ts.
Concurrency core: optimistic rev/_etag updates, leaseEpoch fencing, lease reaper.
Scheduler: pure selectJob / scoreCandidate with capability hard-filter + weighted scoring; Phase-3 resolveWeights (per-product/request) + selectPreemptionVictim.
Feature flags (default OFF): FLEET_PREEMPTION, FLEET_BUDGETS, FLEET_TRACKER_ECHO, FLEET_REQUIRE_FACTORY_TOKEN.

3. Verified test/build status (baseline)

Check	Command	Result
Fleet module tests	`pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet`	✅ 134 passing (8 files)
Platform-service full	`pnpm --filter @lysnrai/platform-service test`	✅ 1646 passing
tracker-web tests	`pnpm --filter @bytelyst/tracker-web test`	✅ 198 passing
Monorepo build	`pnpm build`	✅ green

4. Completed features (verified in code)

Phase 1 — single-host runner (95% per roadmap §0)

✅ Manifest parsing, priority ordering, capability match, engine-class, idempotency dedupe
✅ Profiles + resolution, deps/DAG blocking + cycle detection, warn-only allowed-scope
✅ Crash recovery (recover_orphans), WIP checkpoint/resume, retry w/ backoff, insights
✅ Tracker adapter (from-tracker/to-tracker, idempotent, non-fatal echo)

Phase 2 — coordinator module (roadmap says 80%, code shows ~95%)

✅ fleet module scaffolded; Cosmos containers + repository (memory + Cosmos providers)
✅ Atomic claim (revUpdateJob/_etag) + lease reaper + fencing (leaseEpoch)
✅ Factory-agent API client (lib/fleet-client.sh behind AQ_FLEET)
✅ Scheduler/router core wired — coordinator.claimNextJob calls selectJob (coordinator.ts:502)
✅ Tracker adapter direct call — tracker-bridge.ts ingestItemAsJob/echoJobToItem
✅ Factory enrollment + scoped tokens — enrollment.ts + /fleet/factories/enroll|rotate|revoke
✅ Feature flags + shadow/dual-run; two-factory demo; module test suite

Phase 3 — control plane + DAG + budgets + scoring (roadmap says 0%, code shows ~90%)

✅ Tunable scoring weights (resolveWeights, per-product registry + request override)
✅ Preemption behind FLEET_PREEMPTION (selectPreemptionVictim)
✅ DAG decomposition — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag, parent block/unblock
✅ Budgets — FleetBudgetDoc, GET/PUT/pause/resume, enforcement behind FLEET_BUDGETS
✅ tracker-web fleet UI — overview, jobs table, job detail, budget pages + typed client + proxy
✅ Operator job actions (requeue/reject/cancel) — backend + UI (no lease held; fences worker)
✅ Scoring explainability — GET /fleet/jobs/:id/explain + routing-score UI panel
✅ Cost burndown — per-day series endpoint + chart with ceiling overlay
✅ SSE live log streaming — GET /fleet/jobs/:id/events/stream (resumable) + subscribeJobEvents
✅ Fleet Playwright e2e — e2e/fleet.spec.ts (overview, jobs, job-detail, budget, review gate)
✅ Fleet metrics + alerting — GET /fleet/metrics + overview metrics/alerts panel (§17)
✅ Multi-reviewer routing — review-policy human gate (requestReview/submitReview) + gate UI

5. Partial features (started, not complete)

Feature	What exists	What's missing
TUI dashboard	legacy TUI against single-host queue	re-point at `/fleet` API for parity (P3, separate repo)

6. Missing features (not started)

Phase 3: TUI re-point at /fleet (in learning_ai_devops_tools)
Phase 4: message broker (NATS/Redis), autoscaling hooks, capability marketplace, load/chaos suite
Phase 5: outcome feature capture, offline eval harness, A/B weight tuning, recommendations
Phase 1 leftovers: budget.wall wall-clock enforcement; Node dash tag surfacing

7. Broken flows

None found. Build + all test suites green at audit time.

8. Mock / stubbed flows

dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts proxies to PLATFORM_API_URL (default http://localhost:4003). UI degrades gracefully (404 → null/empty) when the fleet module is unreachable — this is intended, not a stub to replace.
No mock data baked into pages; all reads go through the typed fleet-client.ts.

9. Security risks

Factory token enforcement is flag-gated (FLEET_REQUIRE_FACTORY_TOKEN, default OFF). For production, enrollment tokens should be ON. Documented in .env.example.
Budget enforcement OFF by default — cost-runaway guard (§18) not active until FLEET_BUDGETS=1.

10. Deployment risks

PLATFORM_API_URL must be set for the tracker-web proxy in non-local environments.
Cosmos containers must exist with /productId partition keys before first write.
No live-log transport to blob yet (§17) — operators rely on polling.

11. Prioritized remaining work

See TASKS_TO_COMPLETE.md. Highest-impact safe slices, in order:

Operator job actions (requeue/reject/cancel) — completes a Phase-3 §14 box; backend+UI; low risk
Scoring explainability surfaced — data already computed; additive endpoint + UI
Cost burndown — additive UI on existing budget data
SSE live logs — larger; needs streaming route + consumer
Fleet Playwright e2e — required for Phase-3 exit gate
Phase 4/5 — out of MVP scope; track as future

12. Note on roadmap drift

The roadmap §0 tracker is stale: it shows Phase 3 at 0% and Phase 2 at 80%, but the code implements nearly all Phase-2 boxes and the core Phase-3 backend + UI. A docs reconciliation (ticking §14 boxes in learning_ai_devops_tools) is itself a tracked task (p2-roadmap-tick).

8.4 KiB Raw Blame History