diff --git a/docs/FLEET_CONTROL_PLANE.md b/docs/FLEET_CONTROL_PLANE.md new file mode 100644 index 00000000..a510c9c8 --- /dev/null +++ b/docs/FLEET_CONTROL_PLANE.md @@ -0,0 +1,148 @@ +# Fleet Control Plane — Operational Guide + +> Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI. + +## Feature Flags + +All Phase 3 features are **gated behind environment variables** (default OFF) for safe rollout: + +| Flag | Default | Effect | +| ------------------ | ------- | ----------------------------------------------------------------------------- | +| `FLEET_PREEMPTION` | `""` | Enables seat-limit enforcement + critical-job preemption | +| `FLEET_BUDGETS` | `""` | Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded | + +Set to any truthy value (`"1"`, `"true"`, `"yes"`) to enable. + +## Tunable Scoring Weights + +Scoring determines which queued job a factory picks up next. The formula: + +``` +score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus +``` + +### Weight Resolution Order + +1. **Per-request override** — `weights` field in `POST /fleet/jobs/:id/claim` body +2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })` +3. **Defaults** — `{ age: 1, priority: 10, retries: -2, capabilities: 5 }` + +Each level does a **per-field merge** (not full object replacement). + +## Preemption + +When `FLEET_PREEMPTION` is enabled and a factory is at its `seatLimit`: + +1. A critical-priority job arrives in `claimNextJob` +2. `selectPreemptionVictim(runningJobs, incomingJob)` picks the lowest-scoring running job +3. The victim is evicted: its lease is released with `checkpoint: true`, ensuring the job can resume +4. The critical job takes the freed seat +5. An event `{ type: 'preempted', victim, preemptor }` is recorded + +**Rules:** + +- Only `critical` priority can trigger preemption +- Never preempts jobs of equal or higher priority +- Capability mismatch disqualifies a factory from preemption + +## DAG Job Decomposition + +Submit a composite job with children for parallel fan-out: + +```http +POST /fleet/jobs +{ + "idempotencyKey": "parent-job", + "kind": "composite", + "children": [ + { "idempotencyKey": "child-1", "bodyMd": "..." }, + { "idempotencyKey": "child-2", "bodyMd": "..." } + ] +} +``` + +Or add children later: + +```http +POST /fleet/jobs/:parentId/children +{ + "children": [ + { "idempotencyKey": "child-3", "bodyMd": "..." } + ] +} +``` + +**Behavior:** + +- Parent is automatically blocked until all children complete (children's idempotency keys become parent deps) +- Children unblock parent via `maybeUnblockParent()` when transitioning to `shipped`/`done` +- View the full DAG: `GET /fleet/jobs/:id/dag` + +## Per-Product Budgets + +Control spend per product with USD ceilings: + +```http +PUT /fleet/budgets/:productId +{ "ceilingUsd": 100, "window": "monthly" } +``` + +| Endpoint | Method | Effect | +| ---------------------------------- | ------ | ----------------------- | +| `/fleet/budgets/:productId` | GET | Read current budget | +| `/fleet/budgets/:productId` | PUT | Create/update ceiling | +| `/fleet/budgets/:productId/pause` | POST | Manually pause spending | +| `/fleet/budgets/:productId/resume` | POST | Resume spending | + +**Enforcement:** When `FLEET_BUDGETS` is enabled, `claimNextJob` checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan). + +**Auto-pause:** `accrueSpend(productId, amount)` auto-pauses when `spentUsd >= ceilingUsd`. + +## Fleet Control Plane UI (tracker-web) + +Navigate to **Dashboard → Fleet** in tracker-web. + +### Pages + +| Route | Description | +| ---------------------------- | ----------------------------------------------- | +| `/dashboard/fleet` | Overview — factory health cards + recent jobs | +| `/dashboard/fleet/jobs` | Job list with stage filter tabs | +| `/dashboard/fleet/jobs/[id]` | Job detail — events, runs, artifacts, DAG, SHIP | +| `/dashboard/fleet/budget` | Budget view — spend bar, pause/resume controls | + +### Graceful Degradation + +The UI calls platform-service fleet endpoints via `/api/fleet/[...path]` proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors. + +### Configuration + +| Env Var | Default | Purpose | +| ------------------ | ----------------------- | ----------------------------------- | +| `PLATFORM_API_URL` | `http://localhost:4003` | Platform-service base URL for proxy | + +## API Reference Summary + +| Endpoint | Method | Phase | Notes | +| ---------------------------------- | ------ | ----- | -------------------------------------------------- | +| `/fleet/jobs` | GET | 2 | List jobs (query: stage, productId, limit, offset) | +| `/fleet/jobs` | POST | 2 | Submit job (+ optional children[] for DAG) | +| `/fleet/jobs/:id` | GET | 2 | Get job | +| `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) | +| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job | +| `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job | +| `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree | +| `/fleet/factories` | GET | 2 | List factories | +| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat | +| `/fleet/budgets/:productId` | GET | 3 | Get budget | +| `/fleet/budgets/:productId` | PUT | 3 | Upsert budget | +| `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget | +| `/fleet/budgets/:productId/resume` | POST | 3 | Resume budget | + +## Architecture Decisions + +1. **Feature flags default OFF** — zero breaking changes to Phase 2 behavior +2. **Budget checked first** — avoids expensive job scan when budget is exhausted +3. **DAG via deps array** — reuses existing dependency resolution; no new scheduler logic needed +4. **Preemption requires seat limit** — only triggers when factory genuinely can't take more work +5. **UI degrades gracefully** — all API calls handle 404 → null/empty; no hard failures diff --git a/docs/gigafactory-phase3-progress.md b/docs/gigafactory-phase3-progress.md index 749c7135..9bab1175 100644 --- a/docs/gigafactory-phase3-progress.md +++ b/docs/gigafactory-phase3-progress.md @@ -1,12 +1,12 @@ # Gigafactory Phase 3 — Progress -| Slice | Name | Status | Commit | Verify Gate | -| ----- | ------------------------------------ | ------- | ------ | ----------------------------------------------- | -| 1 | Tunable scoring weights + preemption | DONE | TBD | 119 fleet tests ✅, full build ✅, pnpm test ✅ | -| 2 | DAG job decomposition | WIP | — | — | -| 3 | Per-product budgets | pending | — | — | -| 4 | tracker-web Fleet Control Plane UI | pending | — | — | -| 5 | Docs + roadmap | pending | — | — | +| Slice | Name | Status | Commit | Verify Gate | +| ----- | ------------------------------------ | ------ | -------- | ----------------------------------------------- | +| 1 | Tunable scoring weights + preemption | DONE | 4a209e23 | 119 fleet tests ✅, full build ✅, pnpm test ✅ | +| 2 | DAG job decomposition | DONE | 26606c85 | 127 fleet tests ✅, full build ✅, pnpm test ✅ | +| 3 | Per-product budgets | DONE | fd1b18d7 | 134 fleet tests ✅, full build ✅, pnpm test ✅ | +| 4 | tracker-web Fleet Control Plane UI | DONE | 39ade652 | 198 tracker-web tests ✅, full build ✅ | +| 5 | Docs + roadmap | DONE | (this) | — | ## Slice 1 — Tunable scoring weights + preemption @@ -25,7 +25,64 @@ **Verify gate:** `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet` → 119/119 ✅; `pnpm build && pnpm test` → all green +## Slice 2 — DAG job decomposition + +**Key files:** + +- `services/platform-service/src/modules/fleet/types.ts` — `SubmitChildrenSchema`, added `children[]` to `SubmitJobSchema` +- `services/platform-service/src/modules/fleet/repository.ts` — `listChildrenByParent()` +- `services/platform-service/src/modules/fleet/coordinator.ts` — `maybeUnblockParent()`, `submitChildren()`, `getDagSubtree()` +- `services/platform-service/src/modules/fleet/routes.ts` — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag + +**Design:** Children's idempotency keys added to parent's `deps[]`. Existing `unmetDeps()`/`stageForDeps()` logic handles blocking/unblocking. Atomic fan-out via `submitJob()` with `children[]` array. + +**Tests added:** 8 (DAG fan-out submit, child unblock parent, subtree retrieval) + +**Verify gate:** 127/127 fleet tests ✅; full build + test green + +## Slice 3 — Per-product budgets + +**Key files:** + +- `services/platform-service/src/modules/fleet/types.ts` — `FleetBudgetDoc`, `UpsertBudgetSchema` +- `services/platform-service/src/modules/fleet/repository.ts` — budget CRUD (getBudget, upsertBudget, updateBudget) +- `services/platform-service/src/modules/fleet/coordinator.ts` — `isBudgetsEnabled()`, budget enforcement in `claimNextJob`, `accrueSpend()` with auto-pause +- `services/platform-service/src/modules/fleet/routes.ts` — GET/PUT /fleet/budgets/:productId, POST pause/resume + +**Flags:** `FLEET_BUDGETS` (default OFF) + +**Design:** Budget checked FIRST in claim loop — if paused or ceiling exceeded, immediately return null (no job scan). `accrueSpend()` auto-pauses when ceiling reached. + +**Tests added:** 7 + +**Verify gate:** 134/134 fleet tests ✅; full build + test green + +## Slice 4 — tracker-web Fleet Control Plane UI + +**Key files:** + +- `dashboards/tracker-web/src/lib/fleet-client.ts` — Typed API client with graceful 404 → null degradation +- `dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts` — Proxy route to platform-service +- `dashboards/tracker-web/src/app/dashboard/fleet/page.tsx` — Fleet overview (factory cards + recent jobs) +- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/page.tsx` — Job table with stage filter tabs +- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx` — Job detail (events timeline, runs, artifacts, DAG, SHIP action) +- `dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx` — Budget panel (ceiling/spent bar, pause/resume) +- `dashboards/tracker-web/src/app/dashboard/layout.tsx` — Added "Fleet" nav item + +**UI degrades gracefully:** If platform-service fleet module returns 404 (feature flags off), pages show informational empty states. + +**Tests added:** 16 (fleet-client unit tests, 198 total tracker-web) + +**Verify gate:** 198 tracker-web tests ✅; full build green + +## Slice 5 — Docs + roadmap + +See `docs/FLEET_CONTROL_PLANE.md` for the operational guide. + ## Follow-ups - Weight registry could be loaded from Cosmos (per-product config doc) in a later phase - Seat limit enforcement is tied to FLEET_PREEMPTION flag; could be decoupled later +- E2E Playwright tests for fleet UI (pending Playwright setup in CI) +- Budget history/audit log endpoint +- Real-time WebSocket updates for job stage transitions in the UI