learning_ai_common_plat/docs/FLEET_CONTROL_PLANE.md
Saravanakumar D 325dfcae8e docs(fleet): Phase 3 operational guide + progress report (Slice 5)
- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering:
  - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS)
  - Tunable scoring weights + resolution order
  - Preemption rules and behavior
  - DAG job decomposition API
  - Per-product budgets with auto-pause
  - Fleet Control Plane UI pages and configuration
  - API reference summary
  - Architecture decisions
- Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00

6.7 KiB

Fleet Control Plane — Operational Guide

Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI.

Feature Flags

All Phase 3 features are gated behind environment variables (default OFF) for safe rollout:

Flag Default Effect
FLEET_PREEMPTION "" Enables seat-limit enforcement + critical-job preemption
FLEET_BUDGETS "" Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded

Set to any truthy value ("1", "true", "yes") to enable.

Tunable Scoring Weights

Scoring determines which queued job a factory picks up next. The formula:

score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus

Weight Resolution Order

  1. Per-request overrideweights field in POST /fleet/jobs/:id/claim body
  2. Product registry — set via setWeightRegistry({ [productId]: weights })
  3. Defaults{ age: 1, priority: 10, retries: -2, capabilities: 5 }

Each level does a per-field merge (not full object replacement).

Preemption

When FLEET_PREEMPTION is enabled and a factory is at its seatLimit:

  1. A critical-priority job arrives in claimNextJob
  2. selectPreemptionVictim(runningJobs, incomingJob) picks the lowest-scoring running job
  3. The victim is evicted: its lease is released with checkpoint: true, ensuring the job can resume
  4. The critical job takes the freed seat
  5. An event { type: 'preempted', victim, preemptor } is recorded

Rules:

  • Only critical priority can trigger preemption
  • Never preempts jobs of equal or higher priority
  • Capability mismatch disqualifies a factory from preemption

DAG Job Decomposition

Submit a composite job with children for parallel fan-out:

POST /fleet/jobs
{
  "idempotencyKey": "parent-job",
  "kind": "composite",
  "children": [
    { "idempotencyKey": "child-1", "bodyMd": "..." },
    { "idempotencyKey": "child-2", "bodyMd": "..." }
  ]
}

Or add children later:

POST /fleet/jobs/:parentId/children
{
  "children": [
    { "idempotencyKey": "child-3", "bodyMd": "..." }
  ]
}

Behavior:

  • Parent is automatically blocked until all children complete (children's idempotency keys become parent deps)
  • Children unblock parent via maybeUnblockParent() when transitioning to shipped/done
  • View the full DAG: GET /fleet/jobs/:id/dag

Per-Product Budgets

Control spend per product with USD ceilings:

PUT /fleet/budgets/:productId
{ "ceilingUsd": 100, "window": "monthly" }
Endpoint Method Effect
/fleet/budgets/:productId GET Read current budget
/fleet/budgets/:productId PUT Create/update ceiling
/fleet/budgets/:productId/pause POST Manually pause spending
/fleet/budgets/:productId/resume POST Resume spending

Enforcement: When FLEET_BUDGETS is enabled, claimNextJob checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan).

Auto-pause: accrueSpend(productId, amount) auto-pauses when spentUsd >= ceilingUsd.

Fleet Control Plane UI (tracker-web)

Navigate to Dashboard → Fleet in tracker-web.

Pages

Route Description
/dashboard/fleet Overview — factory health cards + recent jobs
/dashboard/fleet/jobs Job list with stage filter tabs
/dashboard/fleet/jobs/[id] Job detail — events, runs, artifacts, DAG, SHIP
/dashboard/fleet/budget Budget view — spend bar, pause/resume controls

Graceful Degradation

The UI calls platform-service fleet endpoints via /api/fleet/[...path] proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors.

Configuration

Env Var Default Purpose
PLATFORM_API_URL http://localhost:4003 Platform-service base URL for proxy

API Reference Summary

Endpoint Method Phase Notes
/fleet/jobs GET 2 List jobs (query: stage, productId, limit, offset)
/fleet/jobs POST 2 Submit job (+ optional children[] for DAG)
/fleet/jobs/:id GET 2 Get job
/fleet/jobs/:id PATCH 2 Update stage (fenced)
/fleet/jobs/:id/claim POST 2 Factory claims next job
/fleet/jobs/:id/children POST 3 Add children to existing job
/fleet/jobs/:id/dag GET 3 Get DAG subtree
/fleet/factories GET 2 List factories
/fleet/factories/:id/heartbeat POST 2 Factory heartbeat
/fleet/budgets/:productId GET 3 Get budget
/fleet/budgets/:productId PUT 3 Upsert budget
/fleet/budgets/:productId/pause POST 3 Pause budget
/fleet/budgets/:productId/resume POST 3 Resume budget

Architecture Decisions

  1. Feature flags default OFF — zero breaking changes to Phase 2 behavior
  2. Budget checked first — avoids expensive job scan when budget is exhausted
  3. DAG via deps array — reuses existing dependency resolution; no new scheduler logic needed
  4. Preemption requires seat limit — only triggers when factory genuinely can't take more work
  5. UI degrades gracefully — all API calls handle 404 → null/empty; no hard failures