docs(fleet): Phase 3 operational guide + progress report (Slice 5)
- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering: - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS) - Tunable scoring weights + resolution order - Preemption rules and behavior - DAG job decomposition API - Per-product budgets with auto-pause - Fleet Control Plane UI pages and configuration - API reference summary - Architecture decisions - Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
parent
b061cc47f3
commit
325dfcae8e
148
docs/FLEET_CONTROL_PLANE.md
Normal file
148
docs/FLEET_CONTROL_PLANE.md
Normal file
@ -0,0 +1,148 @@
|
||||
# Fleet Control Plane — Operational Guide
|
||||
|
||||
> Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI.
|
||||
|
||||
## Feature Flags
|
||||
|
||||
All Phase 3 features are **gated behind environment variables** (default OFF) for safe rollout:
|
||||
|
||||
| Flag | Default | Effect |
|
||||
| ------------------ | ------- | ----------------------------------------------------------------------------- |
|
||||
| `FLEET_PREEMPTION` | `""` | Enables seat-limit enforcement + critical-job preemption |
|
||||
| `FLEET_BUDGETS` | `""` | Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded |
|
||||
|
||||
Set to any truthy value (`"1"`, `"true"`, `"yes"`) to enable.
|
||||
|
||||
## Tunable Scoring Weights
|
||||
|
||||
Scoring determines which queued job a factory picks up next. The formula:
|
||||
|
||||
```
|
||||
score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus
|
||||
```
|
||||
|
||||
### Weight Resolution Order
|
||||
|
||||
1. **Per-request override** — `weights` field in `POST /fleet/jobs/:id/claim` body
|
||||
2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })`
|
||||
3. **Defaults** — `{ age: 1, priority: 10, retries: -2, capabilities: 5 }`
|
||||
|
||||
Each level does a **per-field merge** (not full object replacement).
|
||||
|
||||
## Preemption
|
||||
|
||||
When `FLEET_PREEMPTION` is enabled and a factory is at its `seatLimit`:
|
||||
|
||||
1. A critical-priority job arrives in `claimNextJob`
|
||||
2. `selectPreemptionVictim(runningJobs, incomingJob)` picks the lowest-scoring running job
|
||||
3. The victim is evicted: its lease is released with `checkpoint: true`, ensuring the job can resume
|
||||
4. The critical job takes the freed seat
|
||||
5. An event `{ type: 'preempted', victim, preemptor }` is recorded
|
||||
|
||||
**Rules:**
|
||||
|
||||
- Only `critical` priority can trigger preemption
|
||||
- Never preempts jobs of equal or higher priority
|
||||
- Capability mismatch disqualifies a factory from preemption
|
||||
|
||||
## DAG Job Decomposition
|
||||
|
||||
Submit a composite job with children for parallel fan-out:
|
||||
|
||||
```http
|
||||
POST /fleet/jobs
|
||||
{
|
||||
"idempotencyKey": "parent-job",
|
||||
"kind": "composite",
|
||||
"children": [
|
||||
{ "idempotencyKey": "child-1", "bodyMd": "..." },
|
||||
{ "idempotencyKey": "child-2", "bodyMd": "..." }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Or add children later:
|
||||
|
||||
```http
|
||||
POST /fleet/jobs/:parentId/children
|
||||
{
|
||||
"children": [
|
||||
{ "idempotencyKey": "child-3", "bodyMd": "..." }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
|
||||
- Parent is automatically blocked until all children complete (children's idempotency keys become parent deps)
|
||||
- Children unblock parent via `maybeUnblockParent()` when transitioning to `shipped`/`done`
|
||||
- View the full DAG: `GET /fleet/jobs/:id/dag`
|
||||
|
||||
## Per-Product Budgets
|
||||
|
||||
Control spend per product with USD ceilings:
|
||||
|
||||
```http
|
||||
PUT /fleet/budgets/:productId
|
||||
{ "ceilingUsd": 100, "window": "monthly" }
|
||||
```
|
||||
|
||||
| Endpoint | Method | Effect |
|
||||
| ---------------------------------- | ------ | ----------------------- |
|
||||
| `/fleet/budgets/:productId` | GET | Read current budget |
|
||||
| `/fleet/budgets/:productId` | PUT | Create/update ceiling |
|
||||
| `/fleet/budgets/:productId/pause` | POST | Manually pause spending |
|
||||
| `/fleet/budgets/:productId/resume` | POST | Resume spending |
|
||||
|
||||
**Enforcement:** When `FLEET_BUDGETS` is enabled, `claimNextJob` checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan).
|
||||
|
||||
**Auto-pause:** `accrueSpend(productId, amount)` auto-pauses when `spentUsd >= ceilingUsd`.
|
||||
|
||||
## Fleet Control Plane UI (tracker-web)
|
||||
|
||||
Navigate to **Dashboard → Fleet** in tracker-web.
|
||||
|
||||
### Pages
|
||||
|
||||
| Route | Description |
|
||||
| ---------------------------- | ----------------------------------------------- |
|
||||
| `/dashboard/fleet` | Overview — factory health cards + recent jobs |
|
||||
| `/dashboard/fleet/jobs` | Job list with stage filter tabs |
|
||||
| `/dashboard/fleet/jobs/[id]` | Job detail — events, runs, artifacts, DAG, SHIP |
|
||||
| `/dashboard/fleet/budget` | Budget view — spend bar, pause/resume controls |
|
||||
|
||||
### Graceful Degradation
|
||||
|
||||
The UI calls platform-service fleet endpoints via `/api/fleet/[...path]` proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors.
|
||||
|
||||
### Configuration
|
||||
|
||||
| Env Var | Default | Purpose |
|
||||
| ------------------ | ----------------------- | ----------------------------------- |
|
||||
| `PLATFORM_API_URL` | `http://localhost:4003` | Platform-service base URL for proxy |
|
||||
|
||||
## API Reference Summary
|
||||
|
||||
| Endpoint | Method | Phase | Notes |
|
||||
| ---------------------------------- | ------ | ----- | -------------------------------------------------- |
|
||||
| `/fleet/jobs` | GET | 2 | List jobs (query: stage, productId, limit, offset) |
|
||||
| `/fleet/jobs` | POST | 2 | Submit job (+ optional children[] for DAG) |
|
||||
| `/fleet/jobs/:id` | GET | 2 | Get job |
|
||||
| `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) |
|
||||
| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job |
|
||||
| `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job |
|
||||
| `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree |
|
||||
| `/fleet/factories` | GET | 2 | List factories |
|
||||
| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat |
|
||||
| `/fleet/budgets/:productId` | GET | 3 | Get budget |
|
||||
| `/fleet/budgets/:productId` | PUT | 3 | Upsert budget |
|
||||
| `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget |
|
||||
| `/fleet/budgets/:productId/resume` | POST | 3 | Resume budget |
|
||||
|
||||
## Architecture Decisions
|
||||
|
||||
1. **Feature flags default OFF** — zero breaking changes to Phase 2 behavior
|
||||
2. **Budget checked first** — avoids expensive job scan when budget is exhausted
|
||||
3. **DAG via deps array** — reuses existing dependency resolution; no new scheduler logic needed
|
||||
4. **Preemption requires seat limit** — only triggers when factory genuinely can't take more work
|
||||
5. **UI degrades gracefully** — all API calls handle 404 → null/empty; no hard failures
|
||||
@ -1,12 +1,12 @@
|
||||
# Gigafactory Phase 3 — Progress
|
||||
|
||||
| Slice | Name | Status | Commit | Verify Gate |
|
||||
| ----- | ------------------------------------ | ------- | ------ | ----------------------------------------------- |
|
||||
| 1 | Tunable scoring weights + preemption | DONE | TBD | 119 fleet tests ✅, full build ✅, pnpm test ✅ |
|
||||
| 2 | DAG job decomposition | WIP | — | — |
|
||||
| 3 | Per-product budgets | pending | — | — |
|
||||
| 4 | tracker-web Fleet Control Plane UI | pending | — | — |
|
||||
| 5 | Docs + roadmap | pending | — | — |
|
||||
| Slice | Name | Status | Commit | Verify Gate |
|
||||
| ----- | ------------------------------------ | ------ | -------- | ----------------------------------------------- |
|
||||
| 1 | Tunable scoring weights + preemption | DONE | 4a209e23 | 119 fleet tests ✅, full build ✅, pnpm test ✅ |
|
||||
| 2 | DAG job decomposition | DONE | 26606c85 | 127 fleet tests ✅, full build ✅, pnpm test ✅ |
|
||||
| 3 | Per-product budgets | DONE | fd1b18d7 | 134 fleet tests ✅, full build ✅, pnpm test ✅ |
|
||||
| 4 | tracker-web Fleet Control Plane UI | DONE | 39ade652 | 198 tracker-web tests ✅, full build ✅ |
|
||||
| 5 | Docs + roadmap | DONE | (this) | — |
|
||||
|
||||
## Slice 1 — Tunable scoring weights + preemption
|
||||
|
||||
@ -25,7 +25,64 @@
|
||||
|
||||
**Verify gate:** `pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet` → 119/119 ✅; `pnpm build && pnpm test` → all green
|
||||
|
||||
## Slice 2 — DAG job decomposition
|
||||
|
||||
**Key files:**
|
||||
|
||||
- `services/platform-service/src/modules/fleet/types.ts` — `SubmitChildrenSchema`, added `children[]` to `SubmitJobSchema`
|
||||
- `services/platform-service/src/modules/fleet/repository.ts` — `listChildrenByParent()`
|
||||
- `services/platform-service/src/modules/fleet/coordinator.ts` — `maybeUnblockParent()`, `submitChildren()`, `getDagSubtree()`
|
||||
- `services/platform-service/src/modules/fleet/routes.ts` — POST /fleet/jobs/:id/children, GET /fleet/jobs/:id/dag
|
||||
|
||||
**Design:** Children's idempotency keys added to parent's `deps[]`. Existing `unmetDeps()`/`stageForDeps()` logic handles blocking/unblocking. Atomic fan-out via `submitJob()` with `children[]` array.
|
||||
|
||||
**Tests added:** 8 (DAG fan-out submit, child unblock parent, subtree retrieval)
|
||||
|
||||
**Verify gate:** 127/127 fleet tests ✅; full build + test green
|
||||
|
||||
## Slice 3 — Per-product budgets
|
||||
|
||||
**Key files:**
|
||||
|
||||
- `services/platform-service/src/modules/fleet/types.ts` — `FleetBudgetDoc`, `UpsertBudgetSchema`
|
||||
- `services/platform-service/src/modules/fleet/repository.ts` — budget CRUD (getBudget, upsertBudget, updateBudget)
|
||||
- `services/platform-service/src/modules/fleet/coordinator.ts` — `isBudgetsEnabled()`, budget enforcement in `claimNextJob`, `accrueSpend()` with auto-pause
|
||||
- `services/platform-service/src/modules/fleet/routes.ts` — GET/PUT /fleet/budgets/:productId, POST pause/resume
|
||||
|
||||
**Flags:** `FLEET_BUDGETS` (default OFF)
|
||||
|
||||
**Design:** Budget checked FIRST in claim loop — if paused or ceiling exceeded, immediately return null (no job scan). `accrueSpend()` auto-pauses when ceiling reached.
|
||||
|
||||
**Tests added:** 7
|
||||
|
||||
**Verify gate:** 134/134 fleet tests ✅; full build + test green
|
||||
|
||||
## Slice 4 — tracker-web Fleet Control Plane UI
|
||||
|
||||
**Key files:**
|
||||
|
||||
- `dashboards/tracker-web/src/lib/fleet-client.ts` — Typed API client with graceful 404 → null degradation
|
||||
- `dashboards/tracker-web/src/app/api/fleet/[...path]/route.ts` — Proxy route to platform-service
|
||||
- `dashboards/tracker-web/src/app/dashboard/fleet/page.tsx` — Fleet overview (factory cards + recent jobs)
|
||||
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/page.tsx` — Job table with stage filter tabs
|
||||
- `dashboards/tracker-web/src/app/dashboard/fleet/jobs/[id]/page.tsx` — Job detail (events timeline, runs, artifacts, DAG, SHIP action)
|
||||
- `dashboards/tracker-web/src/app/dashboard/fleet/budget/page.tsx` — Budget panel (ceiling/spent bar, pause/resume)
|
||||
- `dashboards/tracker-web/src/app/dashboard/layout.tsx` — Added "Fleet" nav item
|
||||
|
||||
**UI degrades gracefully:** If platform-service fleet module returns 404 (feature flags off), pages show informational empty states.
|
||||
|
||||
**Tests added:** 16 (fleet-client unit tests, 198 total tracker-web)
|
||||
|
||||
**Verify gate:** 198 tracker-web tests ✅; full build green
|
||||
|
||||
## Slice 5 — Docs + roadmap
|
||||
|
||||
See `docs/FLEET_CONTROL_PLANE.md` for the operational guide.
|
||||
|
||||
## Follow-ups
|
||||
|
||||
- Weight registry could be loaded from Cosmos (per-product config doc) in a later phase
|
||||
- Seat limit enforcement is tied to FLEET_PREEMPTION flag; could be decoupled later
|
||||
- E2E Playwright tests for fleet UI (pending Playwright setup in CI)
|
||||
- Budget history/audit log endpoint
|
||||
- Real-time WebSocket updates for job stage transitions in the UI
|
||||
|
||||
Loading…
Reference in New Issue
Block a user