docs(fleet): Phase 3 operational guide + progress report (Slice 5)

- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering:
  - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS)
  - Tunable scoring weights + resolution order
  - Preemption rules and behavior
  - DAG job decomposition API
  - Per-product budgets with auto-pause
  - Fleet Control Plane UI pages and configuration
  - API reference summary
  - Architecture decisions
- Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Saravanakumar D 2026-05-30 09:48:55 -07:00
parent b061cc47f3
commit 325dfcae8e
2 changed files with 212 additions and 7 deletions

148
docs/FLEET_CONTROL_PLANE.md Normal file
View File

@ -0,0 +1,148 @@
# Fleet Control Plane — Operational Guide
> Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI.
## Feature Flags
All Phase 3 features are **gated behind environment variables** (default OFF) for safe rollout:
| Flag | Default | Effect |
| ------------------ | ------- | ----------------------------------------------------------------------------- |
| `FLEET_PREEMPTION` | `""` | Enables seat-limit enforcement + critical-job preemption |
| `FLEET_BUDGETS` | `""` | Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded |
Set to any truthy value (`"1"`, `"true"`, `"yes"`) to enable.
## Tunable Scoring Weights
Scoring determines which queued job a factory picks up next. The formula:
```
score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus
```
### Weight Resolution Order
1. **Per-request override**`weights` field in `POST /fleet/jobs/:id/claim` body
2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })`
3. **Defaults**`{ age: 1, priority: 10, retries: -2, capabilities: 5 }`
Each level does a **per-field merge** (not full object replacement).
## Preemption
When `FLEET_PREEMPTION` is enabled and a factory is at its `seatLimit`:
1. A critical-priority job arrives in `claimNextJob`
2. `selectPreemptionVictim(runningJobs, incomingJob)` picks the lowest-scoring running job
3. The victim is evicted: its lease is released with `checkpoint: true`, ensuring the job can resume
4. The critical job takes the freed seat
5. An event `{ type: 'preempted', victim, preemptor }` is recorded
**Rules:**
- Only `critical` priority can trigger preemption
- Never preempts jobs of equal or higher priority
- Capability mismatch disqualifies a factory from preemption
## DAG Job Decomposition
Submit a composite job with children for parallel fan-out:
```http
POST /fleet/jobs
{
"idempotencyKey": "parent-job",
"kind": "composite",
"children": [
{ "idempotencyKey": "child-1", "bodyMd": "..." },
{ "idempotencyKey": "child-2", "bodyMd": "..." }
]
}
```
Or add children later:
```http
POST /fleet/jobs/:parentId/children
{
"children": [
{ "idempotencyKey": "child-3", "bodyMd": "..." }
]
}
```
**Behavior:**
- Parent is automatically blocked until all children complete (children's idempotency keys become parent deps)
- Children unblock parent via `maybeUnblockParent()` when transitioning to `shipped`/`done`
- View the full DAG: `GET /fleet/jobs/:id/dag`
## Per-Product Budgets
Control spend per product with USD ceilings:
```http
PUT /fleet/budgets/:productId
{ "ceilingUsd": 100, "window": "monthly" }
```
| Endpoint | Method | Effect |
| ---------------------------------- | ------ | ----------------------- |
| `/fleet/budgets/:productId` | GET | Read current budget |
| `/fleet/budgets/:productId` | PUT | Create/update ceiling |
| `/fleet/budgets/:productId/pause` | POST | Manually pause spending |
| `/fleet/budgets/:productId/resume` | POST | Resume spending |
**Enforcement:** When `FLEET_BUDGETS` is enabled, `claimNextJob` checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan).
**Auto-pause:** `accrueSpend(productId, amount)` auto-pauses when `spentUsd >= ceilingUsd`.
## Fleet Control Plane UI (tracker-web)
Navigate to **Dashboard → Fleet** in tracker-web.
### Pages
| Route | Description |
| ---------------------------- | ----------------------------------------------- |
| `/dashboard/fleet` | Overview — factory health cards + recent jobs |
| `/dashboard/fleet/jobs` | Job list with stage filter tabs |
| `/dashboard/fleet/jobs/[id]` | Job detail — events, runs, artifacts, DAG, SHIP |
| `/dashboard/fleet/budget` | Budget view — spend bar, pause/resume controls |
### Graceful Degradation
The UI calls platform-service fleet endpoints via `/api/fleet/[...path]` proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors.
### Configuration
| Env Var | Default | Purpose |
| ------------------ | ----------------------- | ----------------------------------- |
| `PLATFORM_API_URL` | `http://localhost:4003` | Platform-service base URL for proxy |
## API Reference Summary
| Endpoint | Method | Phase | Notes |
| ---------------------------------- | ------ | ----- | -------------------------------------------------- |
| `/fleet/jobs` | GET | 2 | List jobs (query: stage, productId, limit, offset) |
| `/fleet/jobs` | POST | 2 | Submit job (+ optional children[] for DAG) |
| `/fleet/jobs/:id` | GET | 2 | Get job |
| `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) |
| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job |
| `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job |
| `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree |
| `/fleet/factories` | GET | 2 | List factories |
| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat |
| `/fleet/budgets/:productId` | GET | 3 | Get budget |
| `/fleet/budgets/:productId` | PUT | 3 | Upsert budget |
| `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget |
| `/fleet/budgets/:productId/resume` | POST | 3 | Resume budget |
## Architecture Decisions
1. **Feature flags default OFF** — zero breaking changes to Phase 2 behavior
2. **Budget checked first** — avoids expensive job scan when budget is exhausted
3. **DAG via deps array** — reuses existing dependency resolution; no new scheduler logic needed
4. **Preemption requires seat limit** — only triggers when factory genuinely can't take more work
5. **UI degrades gracefully** — all API calls handle 404 → null/empty; no hard failures

View File

@ -1,12 +1,12 @@
# Gigafactory Phase 3 — Progress
| Slice | Name | Status | Commit | Verify Gate |
| ----- | ------------------------------------ | ------- | ------ | ----------------------------------------------- |
| 1 | Tunable scoring weights + preemption | DONE | TBD | 119 fleet tests ✅, full build ✅, pnpm test ✅ |
| 2 | DAG job decomposition | WIP | — | — |
| 3 | Per-product budgets | pending | — | — |
| 4 | tracker-web Fleet Control Plane UI | pending | — | — |
| 5 | Docs + roadmap | pending | — | — |
| ----- | ------------------------------------ | ------ | -------- | ----------------------------------------------- |
| 1 | Tunable scoring weights + preemption | DONE | 4a209e23 | 119 fleet tests ✅, full build ✅, pnpm test ✅ |
| 2 | DAG job decomposition | DONE | 26606c85 | 127 fleet tests ✅, full build ✅, pnpm test ✅ |
| 3 | Per-product budgets | DONE | fd1b18d7 | 134 fleet tests ✅, full build ✅, pnpm test ✅ |
| 4 | tracker-web Fleet Control Plane UI | DONE | 39ade652 | 198 tracker-web tests ✅, full build ✅ |
| 5 | Docs + roadmap | DONE | (this) | — |
## Slice 1 — Tunable scoring weights + preemption
@ -25,7 +25,64 @@
**Verify gate:** `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet` → 119/119 ✅; `pnpm build && pnpm test` → all green
## Slice 2 — DAG job decomposition
**Key files:**
- `services/platform-service/src/modules/fleet/types.ts``SubmitChildrenSchema`, added `children[]` to `SubmitJobSchema`
- `services/platform-service/src/modules/fleet/repository.ts``listChildrenByParent()`
- `services/platform-service/src/modules/fleet/coordinator.ts``maybeUnblockParent()`, `submitChildren()`, `getDagSubtree()`
- `services/platform-service/src/modules/fleet/routes.ts` — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag
**Design:** Children's idempotency keys added to parent's `deps[]`. Existing `unmetDeps()`/`stageForDeps()` logic handles blocking/unblocking. Atomic fan-out via `submitJob()` with `children[]` array.
**Tests added:** 8 (DAG fan-out submit, child unblock parent, subtree retrieval)
**Verify gate:** 127/127 fleet tests ✅; full build + test green
## Slice 3 — Per-product budgets
**Key files:**
- `services/platform-service/src/modules/fleet/types.ts``FleetBudgetDoc`, `UpsertBudgetSchema`
- `services/platform-service/src/modules/fleet/repository.ts` — budget CRUD (getBudget, upsertBudget, updateBudget)
- `services/platform-service/src/modules/fleet/coordinator.ts``isBudgetsEnabled()`, budget enforcement in `claimNextJob`, `accrueSpend()` with auto-pause
- `services/platform-service/src/modules/fleet/routes.ts` — GET/PUT /fleet/budgets/:productId, POST pause/resume
**Flags:** `FLEET_BUDGETS` (default OFF)
**Design:** Budget checked FIRST in claim loop — if paused or ceiling exceeded, immediately return null (no job scan). `accrueSpend()` auto-pauses when ceiling reached.
**Tests added:** 7
**Verify gate:** 134/134 fleet tests ✅; full build + test green
## Slice 4 — tracker-web Fleet Control Plane UI
**Key files:**
- `dashboards/tracker-web/src/lib/fleet-client.ts` — Typed API client with graceful 404 → null degradation
- `dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts` — Proxy route to platform-service
- `dashboards/tracker-web/src/app/dashboard/fleet/page.tsx` — Fleet overview (factory cards + recent jobs)
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/page.tsx` — Job table with stage filter tabs
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx` — Job detail (events timeline, runs, artifacts, DAG, SHIP action)
- `dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx` — Budget panel (ceiling/spent bar, pause/resume)
- `dashboards/tracker-web/src/app/dashboard/layout.tsx` — Added "Fleet" nav item
**UI degrades gracefully:** If platform-service fleet module returns 404 (feature flags off), pages show informational empty states.
**Tests added:** 16 (fleet-client unit tests, 198 total tracker-web)
**Verify gate:** 198 tracker-web tests ✅; full build green
## Slice 5 — Docs + roadmap
See `docs/FLEET_CONTROL_PLANE.md` for the operational guide.
## Follow-ups
- Weight registry could be loaded from Cosmos (per-product config doc) in a later phase
- Seat limit enforcement is tied to FLEET_PREEMPTION flag; could be decoupled later
- E2E Playwright tests for fleet UI (pending Playwright setup in CI)
- Budget history/audit log endpoint
- Real-time WebSocket updates for job stage transitions in the UI