- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering: - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS) - Tunable scoring weights + resolution order - Preemption rules and behavior - DAG job decomposition API - Per-product budgets with auto-pause - Fleet Control Plane UI pages and configuration - API reference summary - Architecture decisions - Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6.7 KiB
Fleet Control Plane — Operational Guide
Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI.
Feature Flags
All Phase 3 features are gated behind environment variables (default OFF) for safe rollout:
| Flag | Default | Effect |
|---|---|---|
FLEET_PREEMPTION |
"" |
Enables seat-limit enforcement + critical-job preemption |
FLEET_BUDGETS |
"" |
Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded |
Set to any truthy value ("1", "true", "yes") to enable.
Tunable Scoring Weights
Scoring determines which queued job a factory picks up next. The formula:
score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus
Weight Resolution Order
- Per-request override —
weightsfield inPOST /fleet/jobs/:id/claimbody - Product registry — set via
setWeightRegistry({ [productId]: weights }) - Defaults —
{ age: 1, priority: 10, retries: -2, capabilities: 5 }
Each level does a per-field merge (not full object replacement).
Preemption
When FLEET_PREEMPTION is enabled and a factory is at its seatLimit:
- A critical-priority job arrives in
claimNextJob selectPreemptionVictim(runningJobs, incomingJob)picks the lowest-scoring running job- The victim is evicted: its lease is released with
checkpoint: true, ensuring the job can resume - The critical job takes the freed seat
- An event
{ type: 'preempted', victim, preemptor }is recorded
Rules:
- Only
criticalpriority can trigger preemption - Never preempts jobs of equal or higher priority
- Capability mismatch disqualifies a factory from preemption
DAG Job Decomposition
Submit a composite job with children for parallel fan-out:
POST /fleet/jobs
{
"idempotencyKey": "parent-job",
"kind": "composite",
"children": [
{ "idempotencyKey": "child-1", "bodyMd": "..." },
{ "idempotencyKey": "child-2", "bodyMd": "..." }
]
}
Or add children later:
POST /fleet/jobs/:parentId/children
{
"children": [
{ "idempotencyKey": "child-3", "bodyMd": "..." }
]
}
Behavior:
- Parent is automatically blocked until all children complete (children's idempotency keys become parent deps)
- Children unblock parent via
maybeUnblockParent()when transitioning toshipped/done - View the full DAG:
GET /fleet/jobs/:id/dag
Per-Product Budgets
Control spend per product with USD ceilings:
PUT /fleet/budgets/:productId
{ "ceilingUsd": 100, "window": "monthly" }
| Endpoint | Method | Effect |
|---|---|---|
/fleet/budgets/:productId |
GET | Read current budget |
/fleet/budgets/:productId |
PUT | Create/update ceiling |
/fleet/budgets/:productId/pause |
POST | Manually pause spending |
/fleet/budgets/:productId/resume |
POST | Resume spending |
Enforcement: When FLEET_BUDGETS is enabled, claimNextJob checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan).
Auto-pause: accrueSpend(productId, amount) auto-pauses when spentUsd >= ceilingUsd.
Fleet Control Plane UI (tracker-web)
Navigate to Dashboard → Fleet in tracker-web.
Pages
| Route | Description |
|---|---|
/dashboard/fleet |
Overview — factory health cards + recent jobs |
/dashboard/fleet/jobs |
Job list with stage filter tabs |
/dashboard/fleet/jobs/[id] |
Job detail — events, runs, artifacts, DAG, SHIP |
/dashboard/fleet/budget |
Budget view — spend bar, pause/resume controls |
Graceful Degradation
The UI calls platform-service fleet endpoints via /api/fleet/[...path] proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors.
Configuration
| Env Var | Default | Purpose |
|---|---|---|
PLATFORM_API_URL |
http://localhost:4003 |
Platform-service base URL for proxy |
API Reference Summary
| Endpoint | Method | Phase | Notes |
|---|---|---|---|
/fleet/jobs |
GET | 2 | List jobs (query: stage, productId, limit, offset) |
/fleet/jobs |
POST | 2 | Submit job (+ optional children[] for DAG) |
/fleet/jobs/:id |
GET | 2 | Get job |
/fleet/jobs/:id |
PATCH | 2 | Update stage (fenced) |
/fleet/jobs/:id/claim |
POST | 2 | Factory claims next job |
/fleet/jobs/:id/children |
POST | 3 | Add children to existing job |
/fleet/jobs/:id/dag |
GET | 3 | Get DAG subtree |
/fleet/factories |
GET | 2 | List factories |
/fleet/factories/:id/heartbeat |
POST | 2 | Factory heartbeat |
/fleet/budgets/:productId |
GET | 3 | Get budget |
/fleet/budgets/:productId |
PUT | 3 | Upsert budget |
/fleet/budgets/:productId/pause |
POST | 3 | Pause budget |
/fleet/budgets/:productId/resume |
POST | 3 | Resume budget |
Architecture Decisions
- Feature flags default OFF — zero breaking changes to Phase 2 behavior
- Budget checked first — avoids expensive job scan when budget is exhausted
- DAG via deps array — reuses existing dependency resolution; no new scheduler logic needed
- Preemption requires seat limit — only triggers when factory genuinely can't take more work
- UI degrades gracefully — all API calls handle 404 → null/empty; no hard failures