- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering: - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS) - Tunable scoring weights + resolution order - Preemption rules and behavior - DAG job decomposition API - Per-product budgets with auto-pause - Fleet Control Plane UI pages and configuration - API reference summary - Architecture decisions - Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
89 lines
5.0 KiB
Markdown
89 lines
5.0 KiB
Markdown
# Gigafactory Phase 3 — Progress
|
|
|
|
| Slice | Name | Status | Commit | Verify Gate |
|
|
| ----- | ------------------------------------ | ------ | -------- | ----------------------------------------------- |
|
|
| 1 | Tunable scoring weights + preemption | DONE | 4a209e23 | 119 fleet tests ✅, full build ✅, pnpm test ✅ |
|
|
| 2 | DAG job decomposition | DONE | 26606c85 | 127 fleet tests ✅, full build ✅, pnpm test ✅ |
|
|
| 3 | Per-product budgets | DONE | fd1b18d7 | 134 fleet tests ✅, full build ✅, pnpm test ✅ |
|
|
| 4 | tracker-web Fleet Control Plane UI | DONE | 39ade652 | 198 tracker-web tests ✅, full build ✅ |
|
|
| 5 | Docs + roadmap | DONE | (this) | — |
|
|
|
|
## Slice 1 — Tunable scoring weights + preemption
|
|
|
|
**Key files:**
|
|
|
|
- `services/platform-service/src/modules/fleet/scheduler.ts` — added `resolveWeights()`, `selectPreemptionVictim()`, `FleetWeightRegistry`, `RunningJobView`
|
|
- `services/platform-service/src/modules/fleet/coordinator.ts` — added `isPreemptionEnabled()`, `setWeightRegistry()`, seat-limit enforcement, preemption wiring
|
|
|
|
**Flags:** `FLEET_PREEMPTION` (default OFF = byte-for-byte Phase 2 behavior)
|
|
|
|
**Tests added:** 18 (14 scheduler pure + 4 coordinator integration)
|
|
|
|
- Weight resolution: defaults, partial override, per-request precedence, backward compat
|
|
- Preemption pure: critical evicts lower, never evicts equal/higher, picks lowest victim, capability checks
|
|
- Preemption integration: flag OFF no eviction, flag ON eviction + checkpoint preserved + zombie fenced + event
|
|
|
|
**Verify gate:** `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet` → 119/119 ✅; `pnpm build && pnpm test` → all green
|
|
|
|
## Slice 2 — DAG job decomposition
|
|
|
|
**Key files:**
|
|
|
|
- `services/platform-service/src/modules/fleet/types.ts` — `SubmitChildrenSchema`, added `children[]` to `SubmitJobSchema`
|
|
- `services/platform-service/src/modules/fleet/repository.ts` — `listChildrenByParent()`
|
|
- `services/platform-service/src/modules/fleet/coordinator.ts` — `maybeUnblockParent()`, `submitChildren()`, `getDagSubtree()`
|
|
- `services/platform-service/src/modules/fleet/routes.ts` — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag
|
|
|
|
**Design:** Children's idempotency keys added to parent's `deps[]`. Existing `unmetDeps()`/`stageForDeps()` logic handles blocking/unblocking. Atomic fan-out via `submitJob()` with `children[]` array.
|
|
|
|
**Tests added:** 8 (DAG fan-out submit, child unblock parent, subtree retrieval)
|
|
|
|
**Verify gate:** 127/127 fleet tests ✅; full build + test green
|
|
|
|
## Slice 3 — Per-product budgets
|
|
|
|
**Key files:**
|
|
|
|
- `services/platform-service/src/modules/fleet/types.ts` — `FleetBudgetDoc`, `UpsertBudgetSchema`
|
|
- `services/platform-service/src/modules/fleet/repository.ts` — budget CRUD (getBudget, upsertBudget, updateBudget)
|
|
- `services/platform-service/src/modules/fleet/coordinator.ts` — `isBudgetsEnabled()`, budget enforcement in `claimNextJob`, `accrueSpend()` with auto-pause
|
|
- `services/platform-service/src/modules/fleet/routes.ts` — GET/PUT /fleet/budgets/:productId, POST pause/resume
|
|
|
|
**Flags:** `FLEET_BUDGETS` (default OFF)
|
|
|
|
**Design:** Budget checked FIRST in claim loop — if paused or ceiling exceeded, immediately return null (no job scan). `accrueSpend()` auto-pauses when ceiling reached.
|
|
|
|
**Tests added:** 7
|
|
|
|
**Verify gate:** 134/134 fleet tests ✅; full build + test green
|
|
|
|
## Slice 4 — tracker-web Fleet Control Plane UI
|
|
|
|
**Key files:**
|
|
|
|
- `dashboards/tracker-web/src/lib/fleet-client.ts` — Typed API client with graceful 404 → null degradation
|
|
- `dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts` — Proxy route to platform-service
|
|
- `dashboards/tracker-web/src/app/dashboard/fleet/page.tsx` — Fleet overview (factory cards + recent jobs)
|
|
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/page.tsx` — Job table with stage filter tabs
|
|
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx` — Job detail (events timeline, runs, artifacts, DAG, SHIP action)
|
|
- `dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx` — Budget panel (ceiling/spent bar, pause/resume)
|
|
- `dashboards/tracker-web/src/app/dashboard/layout.tsx` — Added "Fleet" nav item
|
|
|
|
**UI degrades gracefully:** If platform-service fleet module returns 404 (feature flags off), pages show informational empty states.
|
|
|
|
**Tests added:** 16 (fleet-client unit tests, 198 total tracker-web)
|
|
|
|
**Verify gate:** 198 tracker-web tests ✅; full build green
|
|
|
|
## Slice 5 — Docs + roadmap
|
|
|
|
See `docs/FLEET_CONTROL_PLANE.md` for the operational guide.
|
|
|
|
## Follow-ups
|
|
|
|
- Weight registry could be loaded from Cosmos (per-product config doc) in a later phase
|
|
- Seat limit enforcement is tied to FLEET_PREEMPTION flag; could be decoupled later
|
|
- E2E Playwright tests for fleet UI (pending Playwright setup in CI)
|
|
- Budget history/audit log endpoint
|
|
- Real-time WebSocket updates for job stage transitions in the UI
|