learning_ai_common_plat/docs/gigafactory-phase3-progress.md
Saravanakumar D 325dfcae8e docs(fleet): Phase 3 operational guide + progress report (Slice 5)
- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering:
  - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS)
  - Tunable scoring weights + resolution order
  - Preemption rules and behavior
  - DAG job decomposition API
  - Per-product budgets with auto-pause
  - Fleet Control Plane UI pages and configuration
  - API reference summary
  - Architecture decisions
- Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00

5.0 KiB

Gigafactory Phase 3 — Progress

Slice Name Status Commit Verify Gate
1 Tunable scoring weights + preemption DONE 4a209e23 119 fleet tests , full build , pnpm test
2 DAG job decomposition DONE 26606c85 127 fleet tests , full build , pnpm test
3 Per-product budgets DONE fd1b18d7 134 fleet tests , full build , pnpm test
4 tracker-web Fleet Control Plane UI DONE 39ade652 198 tracker-web tests , full build
5 Docs + roadmap DONE (this)

Slice 1 — Tunable scoring weights + preemption

Key files:

  • services/platform-service/src/modules/fleet/scheduler.ts — added resolveWeights(), selectPreemptionVictim(), FleetWeightRegistry, RunningJobView
  • services/platform-service/src/modules/fleet/coordinator.ts — added isPreemptionEnabled(), setWeightRegistry(), seat-limit enforcement, preemption wiring

Flags: FLEET_PREEMPTION (default OFF = byte-for-byte Phase 2 behavior)

Tests added: 18 (14 scheduler pure + 4 coordinator integration)

  • Weight resolution: defaults, partial override, per-request precedence, backward compat
  • Preemption pure: critical evicts lower, never evicts equal/higher, picks lowest victim, capability checks
  • Preemption integration: flag OFF no eviction, flag ON eviction + checkpoint preserved + zombie fenced + event

Verify gate: pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet → 119/119 ; pnpm build && pnpm test → all green

Slice 2 — DAG job decomposition

Key files:

  • services/platform-service/src/modules/fleet/types.tsSubmitChildrenSchema, added children[] to SubmitJobSchema
  • services/platform-service/src/modules/fleet/repository.tslistChildrenByParent()
  • services/platform-service/src/modules/fleet/coordinator.tsmaybeUnblockParent(), submitChildren(), getDagSubtree()
  • services/platform-service/src/modules/fleet/routes.ts — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag

Design: Children's idempotency keys added to parent's deps[]. Existing unmetDeps()/stageForDeps() logic handles blocking/unblocking. Atomic fan-out via submitJob() with children[] array.

Tests added: 8 (DAG fan-out submit, child unblock parent, subtree retrieval)

Verify gate: 127/127 fleet tests ; full build + test green

Slice 3 — Per-product budgets

Key files:

  • services/platform-service/src/modules/fleet/types.tsFleetBudgetDoc, UpsertBudgetSchema
  • services/platform-service/src/modules/fleet/repository.ts — budget CRUD (getBudget, upsertBudget, updateBudget)
  • services/platform-service/src/modules/fleet/coordinator.tsisBudgetsEnabled(), budget enforcement in claimNextJob, accrueSpend() with auto-pause
  • services/platform-service/src/modules/fleet/routes.ts — GET/PUT /fleet/budgets/:productId, POST pause/resume

Flags: FLEET_BUDGETS (default OFF)

Design: Budget checked FIRST in claim loop — if paused or ceiling exceeded, immediately return null (no job scan). accrueSpend() auto-pauses when ceiling reached.

Tests added: 7

Verify gate: 134/134 fleet tests ; full build + test green

Slice 4 — tracker-web Fleet Control Plane UI

Key files:

  • dashboards/tracker-web/src/lib/fleet-client.ts — Typed API client with graceful 404 → null degradation
  • dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts — Proxy route to platform-service
  • dashboards/tracker-web/src/app/dashboard/fleet/page.tsx — Fleet overview (factory cards + recent jobs)
  • dashboards/tracker-web/src/app/dashboard/fleet/jobs/page.tsx — Job table with stage filter tabs
  • dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx — Job detail (events timeline, runs, artifacts, DAG, SHIP action)
  • dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx — Budget panel (ceiling/spent bar, pause/resume)
  • dashboards/tracker-web/src/app/dashboard/layout.tsx — Added "Fleet" nav item

UI degrades gracefully: If platform-service fleet module returns 404 (feature flags off), pages show informational empty states.

Tests added: 16 (fleet-client unit tests, 198 total tracker-web)

Verify gate: 198 tracker-web tests ; full build green

Slice 5 — Docs + roadmap

See docs/FLEET_CONTROL_PLANE.md for the operational guide.

Follow-ups

  • Weight registry could be loaded from Cosmos (per-product config doc) in a later phase
  • Seat limit enforcement is tied to FLEET_PREEMPTION flag; could be decoupled later
  • E2E Playwright tests for fleet UI (pending Playwright setup in CI)
  • Budget history/audit log endpoint
  • Real-time WebSocket updates for job stage transitions in the UI