learning_ai_common_plat/docs/FLEET_CONTROL_PLANE.md
Saravanakumar D 325dfcae8e docs(fleet): Phase 3 operational guide + progress report (Slice 5)
- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering:
  - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS)
  - Tunable scoring weights + resolution order
  - Preemption rules and behavior
  - DAG job decomposition API
  - Per-product budgets with auto-pause
  - Fleet Control Plane UI pages and configuration
  - API reference summary
  - Architecture decisions
- Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00

149 lines
6.7 KiB
Markdown

# Fleet Control Plane — Operational Guide
> Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI.
## Feature Flags
All Phase 3 features are **gated behind environment variables** (default OFF) for safe rollout:
| Flag | Default | Effect |
| ------------------ | ------- | ----------------------------------------------------------------------------- |
| `FLEET_PREEMPTION` | `""` | Enables seat-limit enforcement + critical-job preemption |
| `FLEET_BUDGETS` | `""` | Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded |
Set to any truthy value (`"1"`, `"true"`, `"yes"`) to enable.
## Tunable Scoring Weights
Scoring determines which queued job a factory picks up next. The formula:
```
score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus
```
### Weight Resolution Order
1. **Per-request override**`weights` field in `POST /fleet/jobs/:id/claim` body
2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })`
3. **Defaults**`{ age: 1, priority: 10, retries: -2, capabilities: 5 }`
Each level does a **per-field merge** (not full object replacement).
## Preemption
When `FLEET_PREEMPTION` is enabled and a factory is at its `seatLimit`:
1. A critical-priority job arrives in `claimNextJob`
2. `selectPreemptionVictim(runningJobs, incomingJob)` picks the lowest-scoring running job
3. The victim is evicted: its lease is released with `checkpoint: true`, ensuring the job can resume
4. The critical job takes the freed seat
5. An event `{ type: 'preempted', victim, preemptor }` is recorded
**Rules:**
- Only `critical` priority can trigger preemption
- Never preempts jobs of equal or higher priority
- Capability mismatch disqualifies a factory from preemption
## DAG Job Decomposition
Submit a composite job with children for parallel fan-out:
```http
POST /fleet/jobs
{
"idempotencyKey": "parent-job",
"kind": "composite",
"children": [
{ "idempotencyKey": "child-1", "bodyMd": "..." },
{ "idempotencyKey": "child-2", "bodyMd": "..." }
]
}
```
Or add children later:
```http
POST /fleet/jobs/:parentId/children
{
"children": [
{ "idempotencyKey": "child-3", "bodyMd": "..." }
]
}
```
**Behavior:**
- Parent is automatically blocked until all children complete (children's idempotency keys become parent deps)
- Children unblock parent via `maybeUnblockParent()` when transitioning to `shipped`/`done`
- View the full DAG: `GET /fleet/jobs/:id/dag`
## Per-Product Budgets
Control spend per product with USD ceilings:
```http
PUT /fleet/budgets/:productId
{ "ceilingUsd": 100, "window": "monthly" }
```
| Endpoint | Method | Effect |
| ---------------------------------- | ------ | ----------------------- |
| `/fleet/budgets/:productId` | GET | Read current budget |
| `/fleet/budgets/:productId` | PUT | Create/update ceiling |
| `/fleet/budgets/:productId/pause` | POST | Manually pause spending |
| `/fleet/budgets/:productId/resume` | POST | Resume spending |
**Enforcement:** When `FLEET_BUDGETS` is enabled, `claimNextJob` checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan).
**Auto-pause:** `accrueSpend(productId, amount)` auto-pauses when `spentUsd >= ceilingUsd`.
## Fleet Control Plane UI (tracker-web)
Navigate to **Dashboard → Fleet** in tracker-web.
### Pages
| Route | Description |
| ---------------------------- | ----------------------------------------------- |
| `/dashboard/fleet` | Overview — factory health cards + recent jobs |
| `/dashboard/fleet/jobs` | Job list with stage filter tabs |
| `/dashboard/fleet/jobs/[id]` | Job detail — events, runs, artifacts, DAG, SHIP |
| `/dashboard/fleet/budget` | Budget view — spend bar, pause/resume controls |
### Graceful Degradation
The UI calls platform-service fleet endpoints via `/api/fleet/[...path]` proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors.
### Configuration
| Env Var | Default | Purpose |
| ------------------ | ----------------------- | ----------------------------------- |
| `PLATFORM_API_URL` | `http://localhost:4003` | Platform-service base URL for proxy |
## API Reference Summary
| Endpoint | Method | Phase | Notes |
| ---------------------------------- | ------ | ----- | -------------------------------------------------- |
| `/fleet/jobs` | GET | 2 | List jobs (query: stage, productId, limit, offset) |
| `/fleet/jobs` | POST | 2 | Submit job (+ optional children[] for DAG) |
| `/fleet/jobs/:id` | GET | 2 | Get job |
| `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) |
| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job |
| `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job |
| `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree |
| `/fleet/factories` | GET | 2 | List factories |
| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat |
| `/fleet/budgets/:productId` | GET | 3 | Get budget |
| `/fleet/budgets/:productId` | PUT | 3 | Upsert budget |
| `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget |
| `/fleet/budgets/:productId/resume` | POST | 3 | Resume budget |
## Architecture Decisions
1. **Feature flags default OFF** — zero breaking changes to Phase 2 behavior
2. **Budget checked first** — avoids expensive job scan when budget is exhausted
3. **DAG via deps array** — reuses existing dependency resolution; no new scheduler logic needed
4. **Preemption requires seat limit** — only triggers when factory genuinely can't take more work
5. **UI degrades gracefully** — all API calls handle 404 → null/empty; no hard failures