diff --git a/docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md b/docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md index e6c7df9f..300998b7 100644 --- a/docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md +++ b/docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md @@ -23,7 +23,7 @@ score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts + ### Weight Resolution Order -1. **Per-request override** — `weights` field in `POST /fleet/jobs/:id/claim` body +1. **Per-request override** — `weights` field in the `POST /fleet/claim` body 2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })` 3. **Defaults** — `{ age: 1, priority: 10, retries: -2, capabilities: 5 }` @@ -172,11 +172,13 @@ on branch `aq/job/`, then commits, pushes, and opens a PR via `gh`. The P | `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) | | `/fleet/jobs/:id/actions/:action` | POST | 3 | Operator action: `requeue` / `reject` / `cancel` / `ship` | | `/fleet/jobs/:id/lease/release` | POST | 2 | Release lease (optional `stage`, `insights`, `result`) | -| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job | +| `/fleet/claim` | POST | 2 | Factory claims the next best-fit job | | `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job | | `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree | -| `/fleet/factories` | GET | 2 | List factories | -| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat | +| `/fleet/factories/heartbeat` | POST | 2 | Factory heartbeat (register == first heartbeat) | +| `/fleet/factories/enroll` | POST | 2 | Enroll a factory → one-time scoped token | +| `/fleet/metrics` | GET | 3 | Utilization, health rollup, queue/starvation alerts | +| `/fleet/queue-state` | GET | 4 | M0 RU gate: per-product monotonic queue `version` | | `/fleet/budgets/:productId` | GET | 3 | Get budget | | `/fleet/budgets/:productId` | PUT | 3 | Upsert budget | | `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget | diff --git a/docs/GIGAFACTORY/README.md b/docs/GIGAFACTORY/README.md index 13a608fb..50c5ef55 100644 --- a/docs/GIGAFACTORY/README.md +++ b/docs/GIGAFACTORY/README.md @@ -6,16 +6,17 @@ the `dashboards/tracker-web` UI). ## Contents -| Doc | What it is | -| --- | --- | -| [`ROADMAP_COMPLETION_AUDIT.md`](ROADMAP_COMPLETION_AUDIT.md) | Audit of the current build state against the roadmap: completed / partial / missing features, risks, and prioritized remaining work. | -| [`TASKS_TO_COMPLETE.md`](TASKS_TO_COMPLETE.md) | The actionable, priority-ordered completion checklist that companions the audit. | -| [`gigafactory-phase3-progress.md`](gigafactory-phase3-progress.md) | Phase-3 progress log: per-slice end-state. | -| [`FLEET_CONTROL_PLANE.md`](FLEET_CONTROL_PLANE.md) | Operational guide for running and using the fleet control plane. | +| Doc | What it is | +| ------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------ | +| [`ROADMAP_COMPLETION_AUDIT.md`](ROADMAP_COMPLETION_AUDIT.md) | Audit of the current build state against the roadmap: completed / partial / missing features, risks, and prioritized remaining work. | +| [`TASKS_TO_COMPLETE.md`](TASKS_TO_COMPLETE.md) | The actionable, priority-ordered completion checklist that companions the audit. | +| [`gigafactory-phase3-progress.md`](gigafactory-phase3-progress.md) | Phase-3 progress log: per-slice end-state. | +| [`FLEET_CONTROL_PLANE.md`](FLEET_CONTROL_PLANE.md) | Operational guide for running and using the fleet control plane. | ## Source of truth The canonical spec and system overview live in the runner repo, `learning_ai_devops_tools`, under `agent-queue/docs/GIGAFACTORY/` (`GIGAFACTORY_ROADMAP.md`, -`GIGAFACTORY_SYSTEM_OVERVIEW.md`). +`GIGAFACTORY_SYSTEM_OVERVIEW.md`, and `FLEET_DISPATCH_REDESIGN.md` — the Phase-4 +broker design + the as-built **M0 RU gate** `fleet_queue_state` / `GET /fleet/queue-state`). diff --git a/docs/GIGAFACTORY/ROADMAP_COMPLETION_AUDIT.md b/docs/GIGAFACTORY/ROADMAP_COMPLETION_AUDIT.md index 863a37be..d970cf7c 100644 --- a/docs/GIGAFACTORY/ROADMAP_COMPLETION_AUDIT.md +++ b/docs/GIGAFACTORY/ROADMAP_COMPLETION_AUDIT.md @@ -3,6 +3,17 @@ > Source of truth: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md` > Audit date: 2026-05-30 · Auditor: Principal Full-Stack review > Scope: `services/platform-service/src/modules/fleet/**` + `dashboards/tracker-web/src/app/dashboard/fleet/**` +> +> **⚠️ Status update (2026-05-31) — this audit is a point-in-time snapshot; some +> findings below are now superseded:** +> +> - The roadmap §0 tracker has since been **reconciled** — it now reads Phase 0 +> ✅100% · 1 ✅~98% · 2 ✅~98% · 3 ✅100% · 4 ◐ in progress · 5 ☐. The +> "§0 is stale / Phase 3 0% / Phase 2 80%" notes below are **no longer accurate**. +> - **Phase 4 is no longer "not started": M0 (RU gate) is shipped** — +> `fleet_queue_state` + `GET /fleet/queue-state` + `AQ_FLEET_GATE`. The broker +> (M1+) design + checklist live in +> `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md`. ## 1. Product understanding diff --git a/services/platform-service/src/modules/fleet/README.md b/services/platform-service/src/modules/fleet/README.md index e21552db..d89ed7bc 100644 --- a/services/platform-service/src/modules/fleet/README.md +++ b/services/platform-service/src/modules/fleet/README.md @@ -11,15 +11,16 @@ Spec: `../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROAD ## Containers (partition keys) -| Container | PK | Purpose | -| ----------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- | -| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … | -| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) | -| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` | -| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` | -| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot | -| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) | -| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) | +| Container | PK | Purpose | +| ------------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- | +| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … | +| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) | +| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` | +| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` | +| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot | +| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) | +| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) | +| `fleet_queue_state` | `/productId` | Phase-4 M0 RU gate: monotonic `version` bumped on job create + every stage change (cheap "work changed" signal) | Every document carries `productId`. Containers are registered in `lib/cosmos-init.ts`. @@ -45,7 +46,10 @@ worker mutation (`patchJobFenced`, `renewLease`, `releaseLease`) carries its stale/zombie worker can never overwrite a reassigned job. **Heartbeat.** `heartbeat(factory)` upserts `lastHeartbeatAt` + health/load; -`isFactoryStale` detects a missed-heartbeat factory. +`isFactoryStale` detects a missed-heartbeat factory (stale after +`DEFAULT_STALE_FACTORY_MS = 90_000`). A factory's heartbeat cadence +(`AQ_FLEET_LEASE_RENEW_SEC` on the runner) MUST be well under 90s or a healthy +factory flaps to "stale" between beats — the fleet launcher uses 30s. **Reaper.** `reapExpiredLeases(now)` scans `held` leases with `expiresAt < now`, bumps `leaseEpoch` (fencing the dead holder), returns the job to `queued` (or @@ -69,7 +73,8 @@ so the reaper — not TTL — owns recovery. `PATCH /fleet/jobs/:id` (fenced) · `POST /fleet/claim` · `POST /fleet/jobs/:id/lease/renew` · `POST /fleet/jobs/:id/lease/release` · `POST /fleet/factories/heartbeat` · `GET /fleet/jobs/:id/runs` · -`GET /fleet/jobs/:id/events`. +`GET /fleet/jobs/:id/events` · `GET /fleet/metrics` · +`GET /fleet/queue-state` (Phase-4 M0 RU gate). ## Files