learning_ai_common_plat/docs/gigafactory/gigafactory-phase3-progress.md
Saravanakumar D fc000c4c20 docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move ROADMAP_COMPLETION_AUDIT.md, TASKS_TO_COMPLETE.md,
gigafactory-phase3-progress.md and FLEET_CONTROL_PLANE.md under
docs/gigafactory/ so the scattered Gigafactory docs are easy to discover.
Update intra-doc and cross-repo source-of-truth references (fleet README
and types.ts comment) to the new agent-queue/docs/gigafactory/ path.
Pure docs/comment move; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:07:20 -07:00

5.0 KiB

Gigafactory Phase 3 — Progress

Slice Name Status Commit Verify Gate
1 Tunable scoring weights + preemption DONE 4a209e23 119 fleet tests , full build , pnpm test
2 DAG job decomposition DONE 26606c85 127 fleet tests , full build , pnpm test
3 Per-product budgets DONE fd1b18d7 134 fleet tests , full build , pnpm test
4 tracker-web Fleet Control Plane UI DONE 39ade652 198 tracker-web tests , full build
5 Docs + roadmap DONE (this)

Slice 1 — Tunable scoring weights + preemption

Key files:

  • services/platform-service/src/modules/fleet/scheduler.ts — added resolveWeights(), selectPreemptionVictim(), FleetWeightRegistry, RunningJobView
  • services/platform-service/src/modules/fleet/coordinator.ts — added isPreemptionEnabled(), setWeightRegistry(), seat-limit enforcement, preemption wiring

Flags: FLEET_PREEMPTION (default OFF = byte-for-byte Phase 2 behavior)

Tests added: 18 (14 scheduler pure + 4 coordinator integration)

  • Weight resolution: defaults, partial override, per-request precedence, backward compat
  • Preemption pure: critical evicts lower, never evicts equal/higher, picks lowest victim, capability checks
  • Preemption integration: flag OFF no eviction, flag ON eviction + checkpoint preserved + zombie fenced + event

Verify gate: pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet → 119/119 ; pnpm build && pnpm test → all green

Slice 2 — DAG job decomposition

Key files:

  • services/platform-service/src/modules/fleet/types.tsSubmitChildrenSchema, added children[] to SubmitJobSchema
  • services/platform-service/src/modules/fleet/repository.tslistChildrenByParent()
  • services/platform-service/src/modules/fleet/coordinator.tsmaybeUnblockParent(), submitChildren(), getDagSubtree()
  • services/platform-service/src/modules/fleet/routes.ts — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag

Design: Children's idempotency keys added to parent's deps[]. Existing unmetDeps()/stageForDeps() logic handles blocking/unblocking. Atomic fan-out via submitJob() with children[] array.

Tests added: 8 (DAG fan-out submit, child unblock parent, subtree retrieval)

Verify gate: 127/127 fleet tests ; full build + test green

Slice 3 — Per-product budgets

Key files:

  • services/platform-service/src/modules/fleet/types.tsFleetBudgetDoc, UpsertBudgetSchema
  • services/platform-service/src/modules/fleet/repository.ts — budget CRUD (getBudget, upsertBudget, updateBudget)
  • services/platform-service/src/modules/fleet/coordinator.tsisBudgetsEnabled(), budget enforcement in claimNextJob, accrueSpend() with auto-pause
  • services/platform-service/src/modules/fleet/routes.ts — GET/PUT /fleet/budgets/:productId, POST pause/resume

Flags: FLEET_BUDGETS (default OFF)

Design: Budget checked FIRST in claim loop — if paused or ceiling exceeded, immediately return null (no job scan). accrueSpend() auto-pauses when ceiling reached.

Tests added: 7

Verify gate: 134/134 fleet tests ; full build + test green

Slice 4 — tracker-web Fleet Control Plane UI

Key files:

  • dashboards/tracker-web/src/lib/fleet-client.ts — Typed API client with graceful 404 → null degradation
  • dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts — Proxy route to platform-service
  • dashboards/tracker-web/src/app/dashboard/fleet/page.tsx — Fleet overview (factory cards + recent jobs)
  • dashboards/tracker-web/src/app/dashboard/fleet/jobs/page.tsx — Job table with stage filter tabs
  • dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx — Job detail (events timeline, runs, artifacts, DAG, SHIP action)
  • dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx — Budget panel (ceiling/spent bar, pause/resume)
  • dashboards/tracker-web/src/app/dashboard/layout.tsx — Added "Fleet" nav item

UI degrades gracefully: If platform-service fleet module returns 404 (feature flags off), pages show informational empty states.

Tests added: 16 (fleet-client unit tests, 198 total tracker-web)

Verify gate: 198 tracker-web tests ; full build green

Slice 5 — Docs + roadmap

See FLEET_CONTROL_PLANE.md for the operational guide.

Follow-ups

  • Weight registry could be loaded from Cosmos (per-product config doc) in a later phase
  • Seat limit enforcement is tied to FLEET_PREEMPTION flag; could be decoupled later
  • E2E Playwright tests for fleet UI (pending Playwright setup in CI)
  • Budget history/audit log endpoint
  • Real-time WebSocket updates for job stage transitions in the UI