docs(gigafactory): fix stale/incorrect fleet docs

- fleet module README: add fleet_queue_state container + GET /fleet/queue-state
  and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale
  threshold (AQ_FLEET_LEASE_RENEW_SEC).
- FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and
  /fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and
  /fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories);
  add enroll, metrics, and the M0 queue-state endpoint.
- ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and
  Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings.
- README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
saravanakumardb1 2026-06-01 00:03:05 -07:00
parent ba7db0008d
commit 78c4e47460
4 changed files with 41 additions and 22 deletions

View File

@ -23,7 +23,7 @@ score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts +
### Weight Resolution Order
1. **Per-request override**`weights` field in `POST /fleet/jobs/:id/claim` body
1. **Per-request override**`weights` field in the `POST /fleet/claim` body
2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })`
3. **Defaults**`{ age: 1, priority: 10, retries: -2, capabilities: 5 }`
@ -172,11 +172,13 @@ on branch `aq/job/<jobId>`, then commits, pushes, and opens a PR via `gh`. The P
| `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) |
| `/fleet/jobs/:id/actions/:action` | POST | 3 | Operator action: `requeue` / `reject` / `cancel` / `ship` |
| `/fleet/jobs/:id/lease/release` | POST | 2 | Release lease (optional `stage`, `insights`, `result`) |
| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job |
| `/fleet/claim` | POST | 2 | Factory claims the next best-fit job |
| `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job |
| `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree |
| `/fleet/factories` | GET | 2 | List factories |
| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat |
| `/fleet/factories/heartbeat` | POST | 2 | Factory heartbeat (register == first heartbeat) |
| `/fleet/factories/enroll` | POST | 2 | Enroll a factory → one-time scoped token |
| `/fleet/metrics` | GET | 3 | Utilization, health rollup, queue/starvation alerts |
| `/fleet/queue-state` | GET | 4 | M0 RU gate: per-product monotonic queue `version` |
| `/fleet/budgets/:productId` | GET | 3 | Get budget |
| `/fleet/budgets/:productId` | PUT | 3 | Upsert budget |
| `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget |

View File

@ -6,16 +6,17 @@ the `dashboards/tracker-web` UI).
## Contents
| Doc | What it is |
| --- | --- |
| [`ROADMAP_COMPLETION_AUDIT.md`](ROADMAP_COMPLETION_AUDIT.md) | Audit of the current build state against the roadmap: completed / partial / missing features, risks, and prioritized remaining work. |
| [`TASKS_TO_COMPLETE.md`](TASKS_TO_COMPLETE.md) | The actionable, priority-ordered completion checklist that companions the audit. |
| [`gigafactory-phase3-progress.md`](gigafactory-phase3-progress.md) | Phase-3 progress log: per-slice end-state. |
| [`FLEET_CONTROL_PLANE.md`](FLEET_CONTROL_PLANE.md) | Operational guide for running and using the fleet control plane. |
| Doc | What it is |
| ------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------ |
| [`ROADMAP_COMPLETION_AUDIT.md`](ROADMAP_COMPLETION_AUDIT.md) | Audit of the current build state against the roadmap: completed / partial / missing features, risks, and prioritized remaining work. |
| [`TASKS_TO_COMPLETE.md`](TASKS_TO_COMPLETE.md) | The actionable, priority-ordered completion checklist that companions the audit. |
| [`gigafactory-phase3-progress.md`](gigafactory-phase3-progress.md) | Phase-3 progress log: per-slice end-state. |
| [`FLEET_CONTROL_PLANE.md`](FLEET_CONTROL_PLANE.md) | Operational guide for running and using the fleet control plane. |
## Source of truth
The canonical spec and system overview live in the runner repo,
`learning_ai_devops_tools`, under
`agent-queue/docs/GIGAFACTORY/` (`GIGAFACTORY_ROADMAP.md`,
`GIGAFACTORY_SYSTEM_OVERVIEW.md`).
`GIGAFACTORY_SYSTEM_OVERVIEW.md`, and `FLEET_DISPATCH_REDESIGN.md` — the Phase-4
broker design + the as-built **M0 RU gate** `fleet_queue_state` / `GET /fleet/queue-state`).

View File

@ -3,6 +3,17 @@
> Source of truth: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md`
> Audit date: 2026-05-30 · Auditor: Principal Full-Stack review
> Scope: `services/platform-service/src/modules/fleet/**` + `dashboards/tracker-web/src/app/dashboard/fleet/**`
>
> **⚠️ Status update (2026-05-31) — this audit is a point-in-time snapshot; some
> findings below are now superseded:**
>
> - The roadmap §0 tracker has since been **reconciled** — it now reads Phase 0
> ✅100% · 1 ✅~98% · 2 ✅~98% · 3 ✅100% · 4 ◐ in progress · 5 ☐. The
> "§0 is stale / Phase 3 0% / Phase 2 80%" notes below are **no longer accurate**.
> - **Phase 4 is no longer "not started": M0 (RU gate) is shipped**
> `fleet_queue_state` + `GET /fleet/queue-state` + `AQ_FLEET_GATE`. The broker
> (M1+) design + checklist live in
> `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md`.
## 1. Product understanding

View File

@ -11,15 +11,16 @@ Spec: `../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROAD
## Containers (partition keys)
| Container | PK | Purpose |
| ----------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … |
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot |
| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) |
| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) |
| Container | PK | Purpose |
| ------------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … |
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot |
| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) |
| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) |
| `fleet_queue_state` | `/productId` | Phase-4 M0 RU gate: monotonic `version` bumped on job create + every stage change (cheap "work changed" signal) |
Every document carries `productId`. Containers are registered in
`lib/cosmos-init.ts`.
@ -45,7 +46,10 @@ worker mutation (`patchJobFenced`, `renewLease`, `releaseLease`) carries its
stale/zombie worker can never overwrite a reassigned job.
**Heartbeat.** `heartbeat(factory)` upserts `lastHeartbeatAt` + health/load;
`isFactoryStale` detects a missed-heartbeat factory.
`isFactoryStale` detects a missed-heartbeat factory (stale after
`DEFAULT_STALE_FACTORY_MS = 90_000`). A factory's heartbeat cadence
(`AQ_FLEET_LEASE_RENEW_SEC` on the runner) MUST be well under 90s or a healthy
factory flaps to "stale" between beats — the fleet launcher uses 30s.
**Reaper.** `reapExpiredLeases(now)` scans `held` leases with `expiresAt < now`,
bumps `leaseEpoch` (fencing the dead holder), returns the job to `queued` (or
@ -69,7 +73,8 @@ so the reaper — not TTL — owns recovery.
`PATCH /fleet/jobs/:id` (fenced) · `POST /fleet/claim` ·
`POST /fleet/jobs/:id/lease/renew` · `POST /fleet/jobs/:id/lease/release` ·
`POST /fleet/factories/heartbeat` · `GET /fleet/jobs/:id/runs` ·
`GET /fleet/jobs/:id/events`.
`GET /fleet/jobs/:id/events` · `GET /fleet/metrics` ·
`GET /fleet/queue-state` (Phase-4 M0 RU gate).
## Files