learning_ai_common_plat/docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md
saravanakumardb1 883cf329e5 feat(fleet): PR deliverables — jobs target a repo, factory opens a PR, link recorded
Make "shipped" produce a real artifact. A job can now carry an optional repo
(owner/name or clone URL) + baseBranch; the factory's PR mode runs the agent in an
isolated checkout, opens a PR, and records the link.

Backend:
- SubmitJobSchema + FleetJobDoc: optional repo/baseBranch (recorded on submit).
- FleetRunDoc: optional prUrl/branch.
- ReleaseLease report carries prUrl/branch -> stored on the run.
- +2 coordinator tests.

UI (tracker-web):
- New Job form gains optional Repo + Base branch fields (and fixes the priority
  options to the valid critical/high/medium/low; "normal" was rejected by the API).
- Job detail Runs table shows a PR ↗ link from run.prUrl.
- fleet-client: submitJob repo/baseBranch; FleetRun prUrl/branch; OperatorAction +ship.

Docs: FLEET_CONTROL_PLANE.md "PR deliverable (PR mode)" section.

Verified: tsc clean; fleet suite 182; tracker-web 230.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 05:27:11 -07:00

9.1 KiB

Fleet Control Plane — Operational Guide

Phase 3 of the Agent Gigafactory. Adds tunable scoring, preemption, DAG decomposition, per-product budgets, and a tracker-web UI.

Feature Flags

All Phase 3 features are gated behind environment variables (default OFF) for safe rollout:

Flag Default Effect
FLEET_PREEMPTION "" Enables seat-limit enforcement + critical-job preemption
FLEET_BUDGETS "" Enables per-product USD ceiling enforcement. Pauses jobs when budget exceeded

Set to any truthy value ("1", "true", "yes") to enable.

Tunable Scoring Weights

Scoring determines which queued job a factory picks up next. The formula:

score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + w.capabilities * capabilityBonus

Weight Resolution Order

  1. Per-request overrideweights field in POST /fleet/jobs/:id/claim body
  2. Product registry — set via setWeightRegistry({ [productId]: weights })
  3. Defaults{ age: 1, priority: 10, retries: -2, capabilities: 5 }

Each level does a per-field merge (not full object replacement).

Preemption

When FLEET_PREEMPTION is enabled and a factory is at its seatLimit:

  1. A critical-priority job arrives in claimNextJob
  2. selectPreemptionVictim(runningJobs, incomingJob) picks the lowest-scoring running job
  3. The victim is evicted: its lease is released with checkpoint: true, ensuring the job can resume
  4. The critical job takes the freed seat
  5. An event { type: 'preempted', victim, preemptor } is recorded

Rules:

  • Only critical priority can trigger preemption
  • Never preempts jobs of equal or higher priority
  • Capability mismatch disqualifies a factory from preemption

DAG Job Decomposition

Submit a composite job with children for parallel fan-out:

POST /fleet/jobs
{
  "idempotencyKey": "parent-job",
  "kind": "composite",
  "children": [
    { "idempotencyKey": "child-1", "bodyMd": "..." },
    { "idempotencyKey": "child-2", "bodyMd": "..." }
  ]
}

Or add children later:

POST /fleet/jobs/:parentId/children
{
  "children": [
    { "idempotencyKey": "child-3", "bodyMd": "..." }
  ]
}

Behavior:

  • Parent is automatically blocked until all children complete (children's idempotency keys become parent deps)
  • Children unblock parent via maybeUnblockParent() when transitioning to shipped/done
  • View the full DAG: GET /fleet/jobs/:id/dag

Per-Product Budgets

Control spend per product with USD ceilings:

PUT /fleet/budgets/:productId
{ "ceilingUsd": 100, "window": "monthly" }
Endpoint Method Effect
/fleet/budgets/:productId GET Read current budget
/fleet/budgets/:productId PUT Create/update ceiling
/fleet/budgets/:productId/pause POST Manually pause spending
/fleet/budgets/:productId/resume POST Resume spending

Enforcement: When FLEET_BUDGETS is enabled, claimNextJob checks budget status FIRST. If paused or ceiling exceeded → returns null (no job scan).

Auto-pause: accrueSpend(productId, amount) auto-pauses when spentUsd >= ceilingUsd.

Fleet Control Plane UI (tracker-web)

Navigate to Dashboard → Fleet in tracker-web.

Pages

Route Description
/dashboard/fleet Overview — factory health cards + recent jobs
/dashboard/fleet/jobs Job list with stage filter tabs
/dashboard/fleet/jobs/[id] Job detail — events, runs, artifacts, DAG, SHIP
/dashboard/fleet/budget Budget view — spend bar, pause/resume controls

Graceful Degradation

The UI calls platform-service fleet endpoints via /api/fleet/[...path] proxy. If the fleet module returns 404 (flags off), pages display informational empty states instead of errors.

Configuration

Env Var Default Purpose
PLATFORM_API_URL http://localhost:4003 Platform-service base URL for proxy

Job Lifecycle & Shipping (testing → shipped)

Stages: queued → assigned → building → review → testing → shipped (plus blocked, failed, dead_letter). A factory drives assigned → building → review, then runs its local verify gate.

There are two ways a job reaches the terminal shipped stage (the testing → shipped transition has no claimable lease holder after the review gate, so it is driven by one of):

  1. Factory autoship (AQ_FLEET_AUTOSHIP=1 on the agent-queue factory): when the factory's local verify passes it reports testing, then advances the coordinator job testing → shipped autonomously (the factory's verify is the test phase). This is the autonomous submit → … → shipped path. Default off.
  2. ship operator action (POST /fleet/jobs/:id/actions/:action with ship): an operator/controller marks a non-terminal job shipped. Lease-free (works after the human review gate), idempotent, and retries on optimistic- concurrency conflict.

With AQ_FLEET_AUTOSHIP=0 (default) a verify-passing job rests at testing for the human review gate (review/request + multi-reviewer review approve) or a manual ship.

Whenever a job reaches shipped (autoship PATCH, ship action, or a terminal lease release), the coordinator mirrors the outcome onto the latest run (result = 'shipped', endedAt set) and — if budgets are enabled — accrues that run's insights.costUsd. So the dashboard's per-run result/cost/tokens stay consistent with the job stage.

PR deliverable (PR mode)

A job may carry an optional repo (owner/name or a clone URL) + baseBranch. When the factory runs with AQ_FLEET_PR=1, it runs the agent in an isolated checkout on branch aq/job/<jobId>, then commits, pushes, and opens a PR via gh. The PR URL

  • branch are reported on lease release and recorded on the run (run.prUrl, run.branch) — the dashboard shows a PR ↗ link in the job's Runs table. Submit repo/baseBranch from the dashboard "New Job" form or the POST /fleet/jobs body. This round opens the PR (merge stays a human/CI step); opt-in auto-merge is a planned follow-up.

API Reference Summary

Endpoint Method Phase Notes
/fleet/jobs GET 2 List jobs (query: stage, productId, limit, offset)
/fleet/jobs POST 2 Submit job (+ optional children[] for DAG)
/fleet/jobs/:id GET 2 Get job
/fleet/jobs/:id PATCH 2 Update stage (fenced)
/fleet/jobs/:id/actions/:action POST 3 Operator action: requeue / reject / cancel / ship
/fleet/jobs/:id/lease/release POST 2 Release lease (optional stage, insights, result)
/fleet/jobs/:id/claim POST 2 Factory claims next job
/fleet/jobs/:id/children POST 3 Add children to existing job
/fleet/jobs/:id/dag GET 3 Get DAG subtree
/fleet/factories GET 2 List factories
/fleet/factories/:id/heartbeat POST 2 Factory heartbeat
/fleet/budgets/:productId GET 3 Get budget
/fleet/budgets/:productId PUT 3 Upsert budget
/fleet/budgets/:productId/pause POST 3 Pause budget
/fleet/budgets/:productId/resume POST 3 Resume budget

Architecture Decisions

  1. Feature flags default OFF — zero breaking changes to Phase 2 behavior
  2. Budget checked first — avoids expensive job scan when budget is exhausted
  3. DAG via deps array — reuses existing dependency resolution; no new scheduler logic needed
  4. Preemption requires seat limit — only triggers when factory genuinely can't take more work
  5. UI degrades gracefully — all API calls handle 404 → null/empty; no hard failures