learning_ai_common_plat/docs/gigafactory-phase3-progress.md
Saravanakumar D 325dfcae8e docs(fleet): Phase 3 operational guide + progress report (Slice 5)
- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering:
  - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS)
  - Tunable scoring weights + resolution order
  - Preemption rules and behavior
  - DAG job decomposition API
  - Per-product budgets with auto-pause
  - Fleet Control Plane UI pages and configuration
  - API reference summary
  - Architecture decisions
- Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00

89 lines
5.0 KiB
Markdown

# Gigafactory Phase 3 — Progress
| Slice | Name | Status | Commit | Verify Gate |
| ----- | ------------------------------------ | ------ | -------- | ----------------------------------------------- |
| 1 | Tunable scoring weights + preemption | DONE | 4a209e23 | 119 fleet tests ✅, full build ✅, pnpm test ✅ |
| 2 | DAG job decomposition | DONE | 26606c85 | 127 fleet tests ✅, full build ✅, pnpm test ✅ |
| 3 | Per-product budgets | DONE | fd1b18d7 | 134 fleet tests ✅, full build ✅, pnpm test ✅ |
| 4 | tracker-web Fleet Control Plane UI | DONE | 39ade652 | 198 tracker-web tests ✅, full build ✅ |
| 5 | Docs + roadmap | DONE | (this) | — |
## Slice 1 — Tunable scoring weights + preemption
**Key files:**
- `services/platform-service/src/modules/fleet/scheduler.ts` — added `resolveWeights()`, `selectPreemptionVictim()`, `FleetWeightRegistry`, `RunningJobView`
- `services/platform-service/src/modules/fleet/coordinator.ts` — added `isPreemptionEnabled()`, `setWeightRegistry()`, seat-limit enforcement, preemption wiring
**Flags:** `FLEET_PREEMPTION` (default OFF = byte-for-byte Phase 2 behavior)
**Tests added:** 18 (14 scheduler pure + 4 coordinator integration)
- Weight resolution: defaults, partial override, per-request precedence, backward compat
- Preemption pure: critical evicts lower, never evicts equal/higher, picks lowest victim, capability checks
- Preemption integration: flag OFF no eviction, flag ON eviction + checkpoint preserved + zombie fenced + event
**Verify gate:** `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet` → 119/119 ✅; `pnpm build && pnpm test` → all green
## Slice 2 — DAG job decomposition
**Key files:**
- `services/platform-service/src/modules/fleet/types.ts``SubmitChildrenSchema`, added `children[]` to `SubmitJobSchema`
- `services/platform-service/src/modules/fleet/repository.ts``listChildrenByParent()`
- `services/platform-service/src/modules/fleet/coordinator.ts``maybeUnblockParent()`, `submitChildren()`, `getDagSubtree()`
- `services/platform-service/src/modules/fleet/routes.ts` — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag
**Design:** Children's idempotency keys added to parent's `deps[]`. Existing `unmetDeps()`/`stageForDeps()` logic handles blocking/unblocking. Atomic fan-out via `submitJob()` with `children[]` array.
**Tests added:** 8 (DAG fan-out submit, child unblock parent, subtree retrieval)
**Verify gate:** 127/127 fleet tests ✅; full build + test green
## Slice 3 — Per-product budgets
**Key files:**
- `services/platform-service/src/modules/fleet/types.ts``FleetBudgetDoc`, `UpsertBudgetSchema`
- `services/platform-service/src/modules/fleet/repository.ts` — budget CRUD (getBudget, upsertBudget, updateBudget)
- `services/platform-service/src/modules/fleet/coordinator.ts``isBudgetsEnabled()`, budget enforcement in `claimNextJob`, `accrueSpend()` with auto-pause
- `services/platform-service/src/modules/fleet/routes.ts` — GET/PUT /fleet/budgets/:productId, POST pause/resume
**Flags:** `FLEET_BUDGETS` (default OFF)
**Design:** Budget checked FIRST in claim loop — if paused or ceiling exceeded, immediately return null (no job scan). `accrueSpend()` auto-pauses when ceiling reached.
**Tests added:** 7
**Verify gate:** 134/134 fleet tests ✅; full build + test green
## Slice 4 — tracker-web Fleet Control Plane UI
**Key files:**
- `dashboards/tracker-web/src/lib/fleet-client.ts` — Typed API client with graceful 404 → null degradation
- `dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts` — Proxy route to platform-service
- `dashboards/tracker-web/src/app/dashboard/fleet/page.tsx` — Fleet overview (factory cards + recent jobs)
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/page.tsx` — Job table with stage filter tabs
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx` — Job detail (events timeline, runs, artifacts, DAG, SHIP action)
- `dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx` — Budget panel (ceiling/spent bar, pause/resume)
- `dashboards/tracker-web/src/app/dashboard/layout.tsx` — Added "Fleet" nav item
**UI degrades gracefully:** If platform-service fleet module returns 404 (feature flags off), pages show informational empty states.
**Tests added:** 16 (fleet-client unit tests, 198 total tracker-web)
**Verify gate:** 198 tracker-web tests ✅; full build green
## Slice 5 — Docs + roadmap
See `docs/FLEET_CONTROL_PLANE.md` for the operational guide.
## Follow-ups
- Weight registry could be loaded from Cosmos (per-product config doc) in a later phase
- Seat limit enforcement is tied to FLEET_PREEMPTION flag; could be decoupled later
- E2E Playwright tests for fleet UI (pending Playwright setup in CI)
- Budget history/audit log endpoint
- Real-time WebSocket updates for job stage transitions in the UI