- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering: - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS) - Tunable scoring weights + resolution order - Preemption rules and behavior - DAG job decomposition API - Per-product budgets with auto-pause - Fleet Control Plane UI pages and configuration - API reference summary - Architecture decisions - Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
149 lines
6.7 KiB
Markdown
149 lines
6.7 KiB
Markdown
# Fleet Control Plane — Operational Guide
|
|
|
|
> Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI.
|
|
|
|
## Feature Flags
|
|
|
|
All Phase 3 features are **gated behind environment variables** (default OFF) for safe rollout:
|
|
|
|
| Flag | Default | Effect |
|
|
| ------------------ | ------- | ----------------------------------------------------------------------------- |
|
|
| `FLEET_PREEMPTION` | `""` | Enables seat-limit enforcement + critical-job preemption |
|
|
| `FLEET_BUDGETS` | `""` | Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded |
|
|
|
|
Set to any truthy value (`"1"`, `"true"`, `"yes"`) to enable.
|
|
|
|
## Tunable Scoring Weights
|
|
|
|
Scoring determines which queued job a factory picks up next. The formula:
|
|
|
|
```
|
|
score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus
|
|
```
|
|
|
|
### Weight Resolution Order
|
|
|
|
1. **Per-request override** — `weights` field in `POST /fleet/jobs/:id/claim` body
|
|
2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })`
|
|
3. **Defaults** — `{ age: 1, priority: 10, retries: -2, capabilities: 5 }`
|
|
|
|
Each level does a **per-field merge** (not full object replacement).
|
|
|
|
## Preemption
|
|
|
|
When `FLEET_PREEMPTION` is enabled and a factory is at its `seatLimit`:
|
|
|
|
1. A critical-priority job arrives in `claimNextJob`
|
|
2. `selectPreemptionVictim(runningJobs, incomingJob)` picks the lowest-scoring running job
|
|
3. The victim is evicted: its lease is released with `checkpoint: true`, ensuring the job can resume
|
|
4. The critical job takes the freed seat
|
|
5. An event `{ type: 'preempted', victim, preemptor }` is recorded
|
|
|
|
**Rules:**
|
|
|
|
- Only `critical` priority can trigger preemption
|
|
- Never preempts jobs of equal or higher priority
|
|
- Capability mismatch disqualifies a factory from preemption
|
|
|
|
## DAG Job Decomposition
|
|
|
|
Submit a composite job with children for parallel fan-out:
|
|
|
|
```http
|
|
POST /fleet/jobs
|
|
{
|
|
"idempotencyKey": "parent-job",
|
|
"kind": "composite",
|
|
"children": [
|
|
{ "idempotencyKey": "child-1", "bodyMd": "..." },
|
|
{ "idempotencyKey": "child-2", "bodyMd": "..." }
|
|
]
|
|
}
|
|
```
|
|
|
|
Or add children later:
|
|
|
|
```http
|
|
POST /fleet/jobs/:parentId/children
|
|
{
|
|
"children": [
|
|
{ "idempotencyKey": "child-3", "bodyMd": "..." }
|
|
]
|
|
}
|
|
```
|
|
|
|
**Behavior:**
|
|
|
|
- Parent is automatically blocked until all children complete (children's idempotency keys become parent deps)
|
|
- Children unblock parent via `maybeUnblockParent()` when transitioning to `shipped`/`done`
|
|
- View the full DAG: `GET /fleet/jobs/:id/dag`
|
|
|
|
## Per-Product Budgets
|
|
|
|
Control spend per product with USD ceilings:
|
|
|
|
```http
|
|
PUT /fleet/budgets/:productId
|
|
{ "ceilingUsd": 100, "window": "monthly" }
|
|
```
|
|
|
|
| Endpoint | Method | Effect |
|
|
| ---------------------------------- | ------ | ----------------------- |
|
|
| `/fleet/budgets/:productId` | GET | Read current budget |
|
|
| `/fleet/budgets/:productId` | PUT | Create/update ceiling |
|
|
| `/fleet/budgets/:productId/pause` | POST | Manually pause spending |
|
|
| `/fleet/budgets/:productId/resume` | POST | Resume spending |
|
|
|
|
**Enforcement:** When `FLEET_BUDGETS` is enabled, `claimNextJob` checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan).
|
|
|
|
**Auto-pause:** `accrueSpend(productId, amount)` auto-pauses when `spentUsd >= ceilingUsd`.
|
|
|
|
## Fleet Control Plane UI (tracker-web)
|
|
|
|
Navigate to **Dashboard → Fleet** in tracker-web.
|
|
|
|
### Pages
|
|
|
|
| Route | Description |
|
|
| ---------------------------- | ----------------------------------------------- |
|
|
| `/dashboard/fleet` | Overview — factory health cards + recent jobs |
|
|
| `/dashboard/fleet/jobs` | Job list with stage filter tabs |
|
|
| `/dashboard/fleet/jobs/[id]` | Job detail — events, runs, artifacts, DAG, SHIP |
|
|
| `/dashboard/fleet/budget` | Budget view — spend bar, pause/resume controls |
|
|
|
|
### Graceful Degradation
|
|
|
|
The UI calls platform-service fleet endpoints via `/api/fleet/[...path]` proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors.
|
|
|
|
### Configuration
|
|
|
|
| Env Var | Default | Purpose |
|
|
| ------------------ | ----------------------- | ----------------------------------- |
|
|
| `PLATFORM_API_URL` | `http://localhost:4003` | Platform-service base URL for proxy |
|
|
|
|
## API Reference Summary
|
|
|
|
| Endpoint | Method | Phase | Notes |
|
|
| ---------------------------------- | ------ | ----- | -------------------------------------------------- |
|
|
| `/fleet/jobs` | GET | 2 | List jobs (query: stage, productId, limit, offset) |
|
|
| `/fleet/jobs` | POST | 2 | Submit job (+ optional children[] for DAG) |
|
|
| `/fleet/jobs/:id` | GET | 2 | Get job |
|
|
| `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) |
|
|
| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job |
|
|
| `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job |
|
|
| `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree |
|
|
| `/fleet/factories` | GET | 2 | List factories |
|
|
| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat |
|
|
| `/fleet/budgets/:productId` | GET | 3 | Get budget |
|
|
| `/fleet/budgets/:productId` | PUT | 3 | Upsert budget |
|
|
| `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget |
|
|
| `/fleet/budgets/:productId/resume` | POST | 3 | Resume budget |
|
|
|
|
## Architecture Decisions
|
|
|
|
1. **Feature flags default OFF** — zero breaking changes to Phase 2 behavior
|
|
2. **Budget checked first** — avoids expensive job scan when budget is exhausted
|
|
3. **DAG via deps array** — reuses existing dependency resolution; no new scheduler logic needed
|
|
4. **Preemption requires seat limit** — only triggers when factory genuinely can't take more work
|
|
5. **UI degrades gracefully** — all API calls handle 404 → null/empty; no hard failures
|