docs(gigafactory): fix stale/incorrect fleet docs
- fleet module README: add fleet_queue_state container + GET /fleet/queue-state and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale threshold (AQ_FLEET_LEASE_RENEW_SEC). - FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and /fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and /fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories); add enroll, metrics, and the M0 queue-state endpoint. - ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings. - README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
parent
ba7db0008d
commit
78c4e47460
@ -23,7 +23,7 @@ score = w.age * ageMinutes + w.priority * priorityOrder + w.retries * attempts +
|
||||
|
||||
### Weight Resolution Order
|
||||
|
||||
1. **Per-request override** — `weights` field in `POST /fleet/jobs/:id/claim` body
|
||||
1. **Per-request override** — `weights` field in the `POST /fleet/claim` body
|
||||
2. **Product registry** — set via `setWeightRegistry({ [productId]: weights })`
|
||||
3. **Defaults** — `{ age: 1, priority: 10, retries: -2, capabilities: 5 }`
|
||||
|
||||
@ -172,11 +172,13 @@ on branch `aq/job/<jobId>`, then commits, pushes, and opens a PR via `gh`. The P
|
||||
| `/fleet/jobs/:id` | PATCH | 2 | Update stage (fenced) |
|
||||
| `/fleet/jobs/:id/actions/:action` | POST | 3 | Operator action: `requeue` / `reject` / `cancel` / `ship` |
|
||||
| `/fleet/jobs/:id/lease/release` | POST | 2 | Release lease (optional `stage`, `insights`, `result`) |
|
||||
| `/fleet/jobs/:id/claim` | POST | 2 | Factory claims next job |
|
||||
| `/fleet/claim` | POST | 2 | Factory claims the next best-fit job |
|
||||
| `/fleet/jobs/:id/children` | POST | 3 | Add children to existing job |
|
||||
| `/fleet/jobs/:id/dag` | GET | 3 | Get DAG subtree |
|
||||
| `/fleet/factories` | GET | 2 | List factories |
|
||||
| `/fleet/factories/:id/heartbeat` | POST | 2 | Factory heartbeat |
|
||||
| `/fleet/factories/heartbeat` | POST | 2 | Factory heartbeat (register == first heartbeat) |
|
||||
| `/fleet/factories/enroll` | POST | 2 | Enroll a factory → one-time scoped token |
|
||||
| `/fleet/metrics` | GET | 3 | Utilization, health rollup, queue/starvation alerts |
|
||||
| `/fleet/queue-state` | GET | 4 | M0 RU gate: per-product monotonic queue `version` |
|
||||
| `/fleet/budgets/:productId` | GET | 3 | Get budget |
|
||||
| `/fleet/budgets/:productId` | PUT | 3 | Upsert budget |
|
||||
| `/fleet/budgets/:productId/pause` | POST | 3 | Pause budget |
|
||||
|
||||
@ -6,16 +6,17 @@ the `dashboards/tracker-web` UI).
|
||||
|
||||
## Contents
|
||||
|
||||
| Doc | What it is |
|
||||
| --- | --- |
|
||||
| [`ROADMAP_COMPLETION_AUDIT.md`](ROADMAP_COMPLETION_AUDIT.md) | Audit of the current build state against the roadmap: completed / partial / missing features, risks, and prioritized remaining work. |
|
||||
| [`TASKS_TO_COMPLETE.md`](TASKS_TO_COMPLETE.md) | The actionable, priority-ordered completion checklist that companions the audit. |
|
||||
| [`gigafactory-phase3-progress.md`](gigafactory-phase3-progress.md) | Phase-3 progress log: per-slice end-state. |
|
||||
| [`FLEET_CONTROL_PLANE.md`](FLEET_CONTROL_PLANE.md) | Operational guide for running and using the fleet control plane. |
|
||||
| Doc | What it is |
|
||||
| ------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| [`ROADMAP_COMPLETION_AUDIT.md`](ROADMAP_COMPLETION_AUDIT.md) | Audit of the current build state against the roadmap: completed / partial / missing features, risks, and prioritized remaining work. |
|
||||
| [`TASKS_TO_COMPLETE.md`](TASKS_TO_COMPLETE.md) | The actionable, priority-ordered completion checklist that companions the audit. |
|
||||
| [`gigafactory-phase3-progress.md`](gigafactory-phase3-progress.md) | Phase-3 progress log: per-slice end-state. |
|
||||
| [`FLEET_CONTROL_PLANE.md`](FLEET_CONTROL_PLANE.md) | Operational guide for running and using the fleet control plane. |
|
||||
|
||||
## Source of truth
|
||||
|
||||
The canonical spec and system overview live in the runner repo,
|
||||
`learning_ai_devops_tools`, under
|
||||
`agent-queue/docs/GIGAFACTORY/` (`GIGAFACTORY_ROADMAP.md`,
|
||||
`GIGAFACTORY_SYSTEM_OVERVIEW.md`).
|
||||
`GIGAFACTORY_SYSTEM_OVERVIEW.md`, and `FLEET_DISPATCH_REDESIGN.md` — the Phase-4
|
||||
broker design + the as-built **M0 RU gate** `fleet_queue_state` / `GET /fleet/queue-state`).
|
||||
|
||||
@ -3,6 +3,17 @@
|
||||
> Source of truth: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md`
|
||||
> Audit date: 2026-05-30 · Auditor: Principal Full-Stack review
|
||||
> Scope: `services/platform-service/src/modules/fleet/**` + `dashboards/tracker-web/src/app/dashboard/fleet/**`
|
||||
>
|
||||
> **⚠️ Status update (2026-05-31) — this audit is a point-in-time snapshot; some
|
||||
> findings below are now superseded:**
|
||||
>
|
||||
> - The roadmap §0 tracker has since been **reconciled** — it now reads Phase 0
|
||||
> ✅100% · 1 ✅~98% · 2 ✅~98% · 3 ✅100% · 4 ◐ in progress · 5 ☐. The
|
||||
> "§0 is stale / Phase 3 0% / Phase 2 80%" notes below are **no longer accurate**.
|
||||
> - **Phase 4 is no longer "not started": M0 (RU gate) is shipped** —
|
||||
> `fleet_queue_state` + `GET /fleet/queue-state` + `AQ_FLEET_GATE`. The broker
|
||||
> (M1+) design + checklist live in
|
||||
> `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md`.
|
||||
|
||||
## 1. Product understanding
|
||||
|
||||
|
||||
@ -11,15 +11,16 @@ Spec: `../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROAD
|
||||
|
||||
## Containers (partition keys)
|
||||
|
||||
| Container | PK | Purpose |
|
||||
| ----------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … |
|
||||
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
|
||||
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
|
||||
| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
|
||||
| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot |
|
||||
| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) |
|
||||
| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) |
|
||||
| Container | PK | Purpose |
|
||||
| ------------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … |
|
||||
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
|
||||
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
|
||||
| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
|
||||
| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot |
|
||||
| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) |
|
||||
| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) |
|
||||
| `fleet_queue_state` | `/productId` | Phase-4 M0 RU gate: monotonic `version` bumped on job create + every stage change (cheap "work changed" signal) |
|
||||
|
||||
Every document carries `productId`. Containers are registered in
|
||||
`lib/cosmos-init.ts`.
|
||||
@ -45,7 +46,10 @@ worker mutation (`patchJobFenced`, `renewLease`, `releaseLease`) carries its
|
||||
stale/zombie worker can never overwrite a reassigned job.
|
||||
|
||||
**Heartbeat.** `heartbeat(factory)` upserts `lastHeartbeatAt` + health/load;
|
||||
`isFactoryStale` detects a missed-heartbeat factory.
|
||||
`isFactoryStale` detects a missed-heartbeat factory (stale after
|
||||
`DEFAULT_STALE_FACTORY_MS = 90_000`). A factory's heartbeat cadence
|
||||
(`AQ_FLEET_LEASE_RENEW_SEC` on the runner) MUST be well under 90s or a healthy
|
||||
factory flaps to "stale" between beats — the fleet launcher uses 30s.
|
||||
|
||||
**Reaper.** `reapExpiredLeases(now)` scans `held` leases with `expiresAt < now`,
|
||||
bumps `leaseEpoch` (fencing the dead holder), returns the job to `queued` (or
|
||||
@ -69,7 +73,8 @@ so the reaper — not TTL — owns recovery.
|
||||
`PATCH /fleet/jobs/:id` (fenced) · `POST /fleet/claim` ·
|
||||
`POST /fleet/jobs/:id/lease/renew` · `POST /fleet/jobs/:id/lease/release` ·
|
||||
`POST /fleet/factories/heartbeat` · `GET /fleet/jobs/:id/runs` ·
|
||||
`GET /fleet/jobs/:id/events`.
|
||||
`GET /fleet/jobs/:id/events` · `GET /fleet/metrics` ·
|
||||
`GET /fleet/queue-state` (Phase-4 M0 RU gate).
|
||||
|
||||
## Files
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user