audit: document current Gigafactory completion state

- ROADMAP_COMPLETION_AUDIT.md: verified state vs GIGAFACTORY_ROADMAP source of truth - TASKS_TO_COMPLETE.md: prioritized remaining work with acceptance criteria - Key finding: roadmap §0 tracker is stale (P2 ~95%, P3 ~70% actual vs 80%/0% claimed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:06:33 -07:00 · 2026-05-30 18:06:33 -07:00 · 0f903b935a
commit 0f903b935a
parent 4777b28698
2 changed files with 197 additions and 0 deletions
--- a/docs/ROADMAP_COMPLETION_AUDIT.md
+++ b/docs/ROADMAP_COMPLETION_AUDIT.md
@ -0,0 +1,126 @@
+# Gigafactory — Roadmap Completion Audit
+
+> Source of truth: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md`
+> Audit date: 2026-05-30 · Auditor: Principal Full-Stack review
+> Scope: `services/platform-service/src/modules/fleet/**` + `dashboards/tracker-web/src/app/dashboard/fleet/**`
+
+## 1. Product understanding
+
+The **Agent Gigafactory** is a distributed system that turns work items (tracker Items, manifests)
+into **jobs** executed in parallel by a fleet of **factories** (agent runners on mac/ubuntu/windows).
+A coordinator (the `fleet` module in `platform-service`) owns durable job state in Cosmos, hands jobs
+to factories via **atomic leases** with **fencing tokens**, recovers crashed work via a **reaper**,
+and echoes status back to the tracker. A browser **control plane** in `tracker-web` lets operators
+watch and steer the fleet.
+
+## 2. Current architecture
+
+```
+tracker Item ──ingest──▶ fleet coordinator ──lease/claim──▶ factory agents (bash runner, AQ_FLEET)
+                          │  (platform-service)                    │
+                          ├─ Cosmos: jobs/runs/leases/             ├─ heartbeat / renew / release
+                          │  factories/events/artifacts/budgets    └─ fenced stage transitions
+                          └─ tracker-bridge ──echo──▶ tracker Item
+tracker-web /dashboard/fleet ──/api/fleet proxy──▶ fleet REST (24 endpoints)
+```
+
+- **Module layout** (canonical pattern): `types.ts → repository.ts → coordinator.ts → scheduler.ts → routes.ts`
+  plus `tracker-bridge.ts`, `enrollment.ts`, `artifacts.ts`.
+- **Concurrency core:** optimistic `rev`/`_etag` updates, `leaseEpoch` fencing, lease reaper.
+- **Scheduler:** pure `selectJob` / `scoreCandidate` with capability hard-filter + weighted scoring;
+  Phase-3 `resolveWeights` (per-product/request) + `selectPreemptionVictim`.
+- **Feature flags (default OFF):** `FLEET_PREEMPTION`, `FLEET_BUDGETS`, `FLEET_TRACKER_ECHO`,
+  `FLEET_REQUIRE_FACTORY_TOKEN`.
+
+## 3. Verified test/build status (baseline)
+
+| Check                 | Command                                                                     | Result                   |
+| --------------------- | --------------------------------------------------------------------------- | ------------------------ |
+| Fleet module tests    | `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet` | ✅ 134 passing (8 files) |
+| Platform-service full | `pnpm --filter @lysnrai/platform-service test`                              | ✅ 1646 passing          |
+| tracker-web tests     | `pnpm --filter @bytelyst/tracker-web test`                                  | ✅ 198 passing           |
+| Monorepo build        | `pnpm build`                                                                | ✅ green                 |
+
+## 4. Completed features (verified in code)
+
+### Phase 1 — single-host runner (95% per roadmap §0)
+
+- ✅ Manifest parsing, priority ordering, capability match, engine-class, idempotency dedupe
+- ✅ Profiles + resolution, deps/DAG blocking + cycle detection, warn-only allowed-scope
+- ✅ Crash recovery (`recover_orphans`), WIP checkpoint/resume, retry w/ backoff, insights
+- ✅ Tracker adapter (`from-tracker`/`to-tracker`, idempotent, non-fatal echo)
+
+### Phase 2 — coordinator module (roadmap says 80%, code shows ~95%)
+
+- ✅ `fleet` module scaffolded; Cosmos containers + repository (memory + Cosmos providers)
+- ✅ Atomic claim (`revUpdateJob`/`_etag`) + lease reaper + fencing (`leaseEpoch`)
+- ✅ Factory-agent API client (`lib/fleet-client.sh` behind `AQ_FLEET`)
+- ✅ **Scheduler/router core wired** — `coordinator.claimNextJob` calls `selectJob` (coordinator.ts:502)
+- ✅ **Tracker adapter direct call** — `tracker-bridge.ts` `ingestItemAsJob`/`echoJobToItem`
+- ✅ **Factory enrollment + scoped tokens** — `enrollment.ts` + `/fleet/factories/enroll|rotate|revoke`
+- ✅ Feature flags + shadow/dual-run; two-factory demo; module test suite
+
+### Phase 3 — control plane + DAG + budgets + scoring (roadmap says 0%, code shows ~70%)
+
+- ✅ Tunable scoring weights (`resolveWeights`, per-product registry + request override)
+- ✅ Preemption behind `FLEET_PREEMPTION` (`selectPreemptionVictim`)
+- ✅ DAG decomposition — `POST /fleet/jobs/:id/children`, `GET /fleet/jobs/:id/dag`, parent block/unblock
+- ✅ Budgets — `FleetBudgetDoc`, GET/PUT/pause/resume, enforcement behind `FLEET_BUDGETS`
+- ✅ tracker-web fleet UI — overview, jobs table, job detail, budget pages + typed client + proxy
+
+## 5. Partial features (started, not complete)
+
+| Feature                | What exists                            | What's missing                                                 |
+| ---------------------- | -------------------------------------- | -------------------------------------------------------------- |
+| Job actions            | SHIP (PATCH stage=shipped) in UI       | **requeue / reject / cancel** operator actions (no lease held) |
+| Scoring explainability | `ScoreBreakdown` computed in scheduler | not surfaced via API or UI                                     |
+| Cost burndown          | budget spend bar                       | no per-day/per-job burndown chart + overlays                   |
+| Live logs              | polling on jobs/detail pages           | **SSE** single-stream contract (§17) absent                    |
+
+## 6. Missing features (not started)
+
+- **Phase 3:** SSE live log streaming, multi-reviewer routing, TUI re-point at `/fleet`,
+  fleet metrics + alerting, Playwright fleet e2e, explainability UI
+- **Phase 4:** message broker (NATS/Redis), autoscaling hooks, capability marketplace, load/chaos suite
+- **Phase 5:** outcome feature capture, offline eval harness, A/B weight tuning, recommendations
+- **Phase 1 leftovers:** `budget.wall` wall-clock enforcement; Node `dash` tag surfacing
+
+## 7. Broken flows
+
+None found. Build + all test suites green at audit time.
+
+## 8. Mock / stubbed flows
+
+- `dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts` proxies to `PLATFORM_API_URL`
+  (default `http://localhost:4003`). UI **degrades gracefully** (404 → null/empty) when the
+  fleet module is unreachable — this is intended, not a stub to replace.
+- No mock data baked into pages; all reads go through the typed `fleet-client.ts`.
+
+## 9. Security risks
+
+- Factory token enforcement is **flag-gated** (`FLEET_REQUIRE_FACTORY_TOKEN`, default OFF). For
+  production, enrollment tokens should be ON. Documented in `.env.example`.
+- Budget enforcement OFF by default — cost-runaway guard (§18) not active until `FLEET_BUDGETS=1`.
+
+## 10. Deployment risks
+
+- `PLATFORM_API_URL` must be set for the tracker-web proxy in non-local environments.
+- Cosmos containers must exist with `/productId` partition keys before first write.
+- No live-log transport to blob yet (§17) — operators rely on polling.
+
+## 11. Prioritized remaining work
+
+See `TASKS_TO_COMPLETE.md`. Highest-impact safe slices, in order:
+
+1. **Operator job actions (requeue/reject/cancel)** — completes a Phase-3 §14 box; backend+UI; low risk
+2. **Scoring explainability surfaced** — data already computed; additive endpoint + UI
+3. **Cost burndown** — additive UI on existing budget data
+4. **SSE live logs** — larger; needs streaming route + consumer
+5. **Fleet Playwright e2e** — required for Phase-3 exit gate
+6. Phase 4/5 — out of MVP scope; track as future
+
+## 12. Note on roadmap drift
+
+The roadmap §0 tracker is **stale**: it shows Phase 3 at 0% and Phase 2 at 80%, but the code
+implements nearly all Phase-2 boxes and the core Phase-3 backend + UI. A docs reconciliation
+(ticking §14 boxes in `learning_ai_devops_tools`) is itself a tracked task (`p2-roadmap-tick`).
--- a/docs/TASKS_TO_COMPLETE.md
+++ b/docs/TASKS_TO_COMPLETE.md
@ -0,0 +1,71 @@
+# Gigafactory — Tasks to Complete
+
+> Companion to `ROADMAP_COMPLETION_AUDIT.md`. Ordered by priority. Update checkboxes as work lands.
+
+---
+
+- [ ] **Operator job actions — requeue / reject / cancel**
+  - Priority: P0 (highest-impact safe slice; completes Phase-3 §14 "approve/ship/reject/requeue")
+  - Current status: SHIP exists; requeue/reject/cancel missing
+  - Files involved:
+    - `services/platform-service/src/modules/fleet/coordinator.ts` (new `operatorAction`)
+    - `services/platform-service/src/modules/fleet/routes.ts` (new `POST /fleet/jobs/:id/actions/:action`)
+    - `services/platform-service/src/modules/fleet/coordinator.test.ts` (tests)
+    - `dashboards/tracker-web/src/lib/fleet-client.ts` (client fn)
+    - `dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx` (buttons)
+  - Implementation plan: operator action does NOT require a held lease; it bumps `leaseEpoch`
+    to fence any current holder (mirrors the reaper), preserves checkpoint, sets stage
+    (requeue→queued/blocked, reject→dead_letter, cancel→failed), appends an event.
+  - Acceptance criteria: requeue a building job → stage queued, epoch+1, zombie report fenced (409);
+    reject → dead_letter; cancel → failed; unknown action → 400; flag-independent; all prior tests green.
+  - Verification command: `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet`
+
+- [ ] **Scoring explainability surfaced in UI**
+  - Priority: P1 (data already computed; Phase-3 §14)
+  - Current status: `ScoreBreakdown` computed in `scheduler.ts`, discarded after selection
+  - Files involved: `scheduler.ts`, `coordinator.ts`, `routes.ts`, `fleet-client.ts`, fleet job detail page
+  - Implementation plan: add `GET /fleet/jobs/:id/explain` returning the would-be score breakdown
+    against current factories; render a "why this routes here" panel.
+  - Acceptance criteria: endpoint returns per-factor contributions; UI shows them; degrade if absent.
+  - Verification command: `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet`
+
+- [ ] **Cost burndown chart**
+  - Priority: P1
+  - Current status: budget page shows a spend bar only
+  - Files involved: `dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx`, new client fn
+  - Implementation plan: aggregate run cost by day from events/runs; render burndown vs ceiling overlay.
+  - Acceptance criteria: per-day spend visible with ceiling line; empty state when no data.
+  - Verification command: `pnpm --filter @bytelyst/tracker-web test`
+
+- [ ] **SSE live log streaming**
+  - Priority: P2 (larger; §17 single-stream contract)
+  - Current status: polling only
+  - Files involved: new streaming route in platform-service; EventSource consumer in job detail page
+  - Implementation plan: `GET /fleet/jobs/:id/events/stream` (SSE) emitting appended events;
+    UI subscribes via EventSource with polling fallback.
+  - Acceptance criteria: new events appear without refresh; reconnect + fallback work.
+  - Verification command: `pnpm --filter @lysnrai/platform-service test`
+
+- [ ] **Fleet Playwright e2e**
+  - Priority: P2 (Phase-3 exit gate)
+  - Current status: none for fleet pages
+  - Files involved: `dashboards/tracker-web/e2e/fleet.spec.ts`
+  - Implementation plan: cover fleet map render, jobs table, job detail action, budget pause/resume
+    against a mocked fleet API.
+  - Acceptance criteria: e2e green in CI config.
+  - Verification command: `pnpm --filter @bytelyst/tracker-web exec playwright test fleet`
+
+- [ ] **Phase-1 `budget.wall` enforcement** — P3 — `agent-queue.sh` — wall-clock ceiling extending timeout.
+- [ ] **Node `dash` tag surfacing** — P3 — `dashboard.mjs` — profile/priority/caps/tracker-item link.
+- [ ] **Roadmap §14 reconciliation** — P3 — tick Phase-2/3 boxes in `learning_ai_devops_tools`.
+- [ ] **Fleet metrics + alerting** — P3 — queue depth, assign latency, utilization, reclaim counts (§17).
+- [ ] **Multi-reviewer routing** — P3 — Phase-3 §14.
+- [ ] **TUI re-point at `/fleet`** — P3 — Phase-3 §14.
+
+### Phase 4 / 5 (post-MVP, tracked only)
+
+- [ ] Message broker (NATS/Redis) push dispatch + backpressure
+- [ ] Autoscaling hooks (ephemeral factories)
+- [ ] Capability marketplace + cross-product fairness
+- [ ] Load + chaos suite
+- [ ] Outcome feature capture · offline eval harness · A/B weight tuning · recommendations