- System overview: mark Phase 4 in-progress (M0 RU gate shipped), add fleet_queue_state container + GET /fleet/queue-state, document the heartbeat cadence vs 90s stale gotcha, the tracker-web caps=build form bug, the missing deregister API, and the ended=-race fix; drop the now-false "roadmap §0 stale" and "boxes 384/386 unticked" claims (both reconciled); link the redesign doc. - Roadmap: §0 Phase 4 -> in progress (M0); align the Phase-2 §8 spec endpoint sketches to the as-built API (/fleet/factories/enroll, /factories/heartbeat, /fleet/claim) + note the heartbeat cadence and the M0 gate. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
452 lines
23 KiB
Markdown
452 lines
23 KiB
Markdown
# Agent Gigafactory — System Overview (current picture)
|
||
|
||
> Companion to `GIGAFACTORY_ROADMAP.md` (the source-of-truth spec & checklists).
|
||
> This document describes **what is actually built today**, how the pieces fit
|
||
> together, the architecture diagrams, the code map across both repos, the next
|
||
> steps, and the known bugs/gaps. Last reviewed: **2026-05-31**.
|
||
>
|
||
> The **Phase-4 plan + the as-built M0 RU gate** live in
|
||
> [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) — read it for the
|
||
> broker-backed dispatch design and the migration checklist.
|
||
|
||
---
|
||
|
||
## 1. What it is (in one paragraph)
|
||
|
||
The **Agent Gigafactory** turns a single-host "folder queue" agent runner into a
|
||
**distributed fleet** of agent "factories" (machines: mac/ubuntu/windows) that
|
||
claim and execute coding jobs in parallel, coordinated by a durable,
|
||
product-agnostic service. A job is a markdown manifest (persona + capabilities +
|
||
budget + deps); the **coordinator** assigns each job to the best-fit factory via a
|
||
deterministic scoring router, guarantees **exactly-once assignment** through
|
||
optimistic-concurrency claims + **leases with epoch fencing**, recovers crashed
|
||
work automatically (reaper + WIP checkpoints), enforces **per-product budgets**,
|
||
supports **DAG decomposition** (composite → child jobs), and exposes the whole
|
||
fleet through **two control planes**: a browser UI (`tracker-web`) and a terminal
|
||
TUI (`agent-queue` dashboard). Both control planes talk to the same `/fleet` REST
|
||
API.
|
||
|
||
---
|
||
|
||
## 2. Completion snapshot (reality, not the stale table)
|
||
|
||
| Phase | Theme | Real status | Notes |
|
||
| ----- | ----- | ----------- | ----- |
|
||
| **0** | Single-host baseline | ✅ 100% | `agent-queue.sh` folder queue, selftest green |
|
||
| **1** | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node `dash` field surfacing — **now also done** via fleet-dash tags. Effectively complete |
|
||
| **2** | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are **done in code** but boxes 384/386 unticked in roadmap (see §11 Gaps) |
|
||
| **3** | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run |
|
||
| **4** | Message bus + autoscaling + capability marketplace | 🟡 in progress | **M0 (RU gate) shipped** — see below. Broker (M1+) not started. Plan: [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) |
|
||
| **5** | Self-optimizing / learned routing | ☐ 0% | Not started |
|
||
|
||
> **Phase-4 M0 (RU gate) is live (2026-05-31):** a per-product `fleet_queue_state`
|
||
> doc holds a monotonic `version` (bumped on job create + every stage change);
|
||
> factories with `AQ_FLEET_GATE=1` point-read `GET /fleet/queue-state` (~1 RU) and
|
||
> skip the expensive claim while nothing changed — cutting idle Cosmos RU without
|
||
> raising the local poll interval. Default OFF; the live fleet runs it on.
|
||
|
||
---
|
||
|
||
## 3. System architecture
|
||
|
||
```mermaid
|
||
graph TB
|
||
subgraph CP["Control planes (operators)"]
|
||
WEB["tracker-web Fleet UI<br/>(Next.js, /dashboard/fleet/*)"]
|
||
TUI["agent-queue TUI<br/>(dashboard.mjs, AQ_FLEET_DASH=1)"]
|
||
end
|
||
|
||
subgraph SVC["platform-service — fleet module (the spine)"]
|
||
ROUTES["routes.ts<br/>/fleet REST + SSE"]
|
||
COORD["coordinator.ts<br/>claim · lease · fence · reaper<br/>preemption · budgets · DAG · review"]
|
||
SCHED["scheduler.ts<br/>pure scoring router (§7)"]
|
||
ENROLL["enrollment.ts<br/>factory tokens (scoped, rotatable)"]
|
||
BRIDGE["tracker-bridge.ts<br/>job ↔ tracker item"]
|
||
ARTIF["artifacts.ts / artifacts-blob.ts<br/>pointer + blob bytes"]
|
||
REPO["repository.ts<br/>CAS (rev/_etag) CRUD"]
|
||
end
|
||
|
||
subgraph DATA["@bytelyst/datastore (Cosmos / memory)"]
|
||
JOBS[("fleet_jobs")]
|
||
RUNS[("fleet_runs")]
|
||
LEASES[("fleet_leases")]
|
||
FAC[("fleet_factories")]
|
||
PROF[("fleet_profiles")]
|
||
EVENTS[("fleet_events")]
|
||
ARTDOCS[("fleet_artifacts")]
|
||
end
|
||
|
||
subgraph FLEET["Factory agents (workers, N hosts)"]
|
||
F1["agent-queue.sh + lib/fleet-client.sh<br/>(AQ_FLEET=1) — mac-1"]
|
||
F2["agent-queue.sh + lib/fleet-client.sh<br/>ubuntu-1"]
|
||
ENGINES["engines: claude · codex · devin"]
|
||
end
|
||
|
||
WEB -->|/api/fleet proxy| ROUTES
|
||
TUI -->|lib/fleet-dash.mjs| ROUTES
|
||
ROUTES --> COORD
|
||
COORD --> SCHED
|
||
ROUTES --> ENROLL
|
||
ROUTES --> BRIDGE
|
||
ROUTES --> ARTIF
|
||
COORD --> REPO
|
||
ENROLL --> REPO
|
||
BRIDGE --> REPO
|
||
ARTIF --> ARTDOCS
|
||
REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS
|
||
|
||
F1 -->|heartbeat · claim · patch fenced · renew| ROUTES
|
||
F2 -->|heartbeat · claim · patch fenced · renew| ROUTES
|
||
F1 --> ENGINES
|
||
F2 --> ENGINES
|
||
```
|
||
|
||
**Layering principle:** `scheduler.ts` is **pure** (no I/O — all inputs passed
|
||
in), `coordinator.ts` is the orchestration core, `repository.ts` is the only thing
|
||
that touches the datastore, and `routes.ts` is the only thing that touches HTTP.
|
||
Factories never touch the DB directly — they only call REST.
|
||
|
||
---
|
||
|
||
## 4. Job lifecycle (stages)
|
||
|
||
```mermaid
|
||
stateDiagram-v2
|
||
[*] --> queued: submitJob
|
||
queued --> blocked: unmet deps
|
||
blocked --> queued: deps satisfied (reaper/unblock)
|
||
queued --> assigned: claimNextJob (CAS win + lease)
|
||
assigned --> building: factory starts (patch fenced)
|
||
building --> review: rc=0 → review gate
|
||
building --> testing: verify-pass (auto)
|
||
review --> testing: approve / requestReview quorum
|
||
testing --> shipped: ship (manual gate)
|
||
building --> failed: verify-fail / budget_exceeded / timeout
|
||
review --> failed: reject
|
||
assigned --> queued: lease expired (reaper, +epoch, keep checkpoint)
|
||
building --> queued: preempted (critical job, checkpoint + epoch bump)
|
||
failed --> queued: requeue (operator)
|
||
failed --> dead_letter: retries exhausted
|
||
shipped --> [*]
|
||
dead_letter --> [*]
|
||
```
|
||
|
||
Stages (`types.ts`): `queued · blocked · assigned · building · review · testing ·
|
||
shipped · failed · dead_letter`. The TUI/local board collapse these onto kanban
|
||
buckets (`inbox/building/review/testing/shipped/failed`) for parity.
|
||
|
||
---
|
||
|
||
## 5. The core guarantee — atomic claim + lease fencing
|
||
|
||
This is the heart of "no double-assignment, ever" and "a dead worker can never
|
||
corrupt a reassigned job."
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant FA as Factory A
|
||
participant FB as Factory B
|
||
participant CO as coordinator
|
||
participant DB as fleet_jobs / fleet_leases
|
||
|
||
FA->>CO: POST /fleet/claim (caps)
|
||
FB->>CO: POST /fleet/claim (caps)
|
||
CO->>DB: selectJob() → job J (rev=5)
|
||
CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS)
|
||
DB-->>CO: A wins (rev→6, leaseEpoch=1)
|
||
CO->>DB: revUpdate J IF rev==5 (B's CAS)
|
||
DB-->>CO: conflict (B re-selects)
|
||
CO-->>FA: assigned J (leaseEpoch=1)
|
||
CO-->>FB: conflict → next job
|
||
|
||
Note over FA: A crashes mid-build
|
||
CO->>DB: reapExpiredLeases(): lease expired → J back to queued,<br/>leaseEpoch=2, checkpoint preserved
|
||
FB->>CO: claim → J (leaseEpoch=2)
|
||
FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1
|
||
CO-->>FA: 409 fenced (1 < 2) — rejected
|
||
```
|
||
|
||
- **CAS:** `repository.revUpdateJob/revUpdateLease` write only if stored `rev`
|
||
matches (Cosmos `_etag`/`If-Match`; memory provider re-reads `rev`).
|
||
- **Fencing:** every worker mutation carries `leaseEpoch`; epoch `< job.leaseEpoch`
|
||
⇒ `fenced` (409).
|
||
- **Reaper:** `reapExpiredLeases(now)` requeues expired-lease jobs, **bumps the
|
||
epoch**, and **keeps the `checkpoint`** (WIP git branch pointer) so work resumes
|
||
rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery.
|
||
|
||
---
|
||
|
||
## 6. Data model (Cosmos containers)
|
||
|
||
| Container | PK | Purpose |
|
||
| --------- | -- | ------- |
|
||
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `depsMode`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, `kind`, `parentId` |
|
||
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
|
||
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
|
||
| `fleet_factories` | `/productId` | worker host: `capabilities[]`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
|
||
| `fleet_profiles` | `/productId` | immutable, versioned persona/capability profile snapshot |
|
||
| `fleet_events` | `/jobId` | append-only audit stream (monotonic `seq`) — powers SSE |
|
||
| `fleet_artifacts` | `/jobId` | **pointers** to blob-stored artifacts (no inline logs) |
|
||
| `fleet_queue_state` | `/productId` | **Phase-4 M0 RU gate**: monotonic `version` bumped on job create + every stage change; read via `GET /fleet/queue-state` so a factory can cheaply detect "work changed" |
|
||
|
||
Every document carries `productId`. Containers registered in `lib/cosmos-init.ts`.
|
||
|
||
---
|
||
|
||
## 7. The scheduler / scoring router (`scheduler.ts`)
|
||
|
||
Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in
|
||
Phase 5). Filter → score → rank:
|
||
|
||
```
|
||
score = w1·capabilityFit + w2·affinity + w3·(1/(1+load))
|
||
+ w4·costFit(budget) + w5·health − w6·starvationPenalty(age)
|
||
```
|
||
|
||
Default weights (`DEFAULT_WEIGHTS`): `capabilityFit 1.0 · affinity 0.5 · load 1.0
|
||
· costFit 0.75 · health 1.0 · starvation 1.5`. Capability is a **hard filter**
|
||
(subset check); `down` factories are filtered out, not scored; aging fully
|
||
de-penalises after ~30 min (anti-starvation). `scoreCandidate` returns a per-term
|
||
breakdown that powers the **explainability** panel (`GET /fleet/jobs/:id/explain`
|
||
→ `ExplainPanel`). `selectPreemptionVictim` picks the lowest-priority running job a
|
||
critical job may evict (under `FLEET_PREEMPTION`).
|
||
|
||
---
|
||
|
||
## 8. Subsystems at a glance
|
||
|
||
| Subsystem | File(s) | What it does | Flag |
|
||
| --------- | ------- | ------------ | ---- |
|
||
| Claim / lease / fence / reaper | `coordinator.ts` | exactly-once assignment, recovery | — |
|
||
| Scoring router + preemption | `scheduler.ts`, `coordinator.ts` | best-fit assignment, evict low-pri for critical | `FLEET_PREEMPTION` |
|
||
| Per-product budgets | `coordinator.ts` (`accrueSpend`, `pause/resume`) | ceiling + auto-pause kill-switch; burndown | `FLEET_BUDGETS` |
|
||
| DAG decomposition | `coordinator.ts` (`submitChildren`, `getDagSubtree`, `maybeUnblockParent`) | composite job fans out to children; deps gate parent | — |
|
||
| Review gate | `coordinator.ts` (`requestReview`, `submitReview`) | multi-reviewer quorum before ship | — |
|
||
| Factory enrollment | `enrollment.ts` | scoped, rotatable, hashed tokens; auth on claim/heartbeat | — |
|
||
| Tracker bridge | `tracker-bridge.ts` | idempotent ingest of tracker item → job; one-way status echo | — |
|
||
| Artifacts | `artifacts.ts`, `artifacts-blob.ts` | pointer docs in Cosmos, bytes in blob (SAS) | — |
|
||
| Live events | `routes.ts` SSE + `fleet_events` | `GET /fleet/jobs/:id/events/stream` | — |
|
||
| Metrics / alerts | `coordinator.ts` (`fleetMetrics`) | utilization, health rollup, starvation alerts | — |
|
||
|
||
---
|
||
|
||
## 9. REST API surface (`/fleet`, under `/api`, auth + `x-product-id`)
|
||
|
||
```
|
||
Jobs POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id
|
||
PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action
|
||
Claim POST /fleet/claim
|
||
Lease POST /fleet/jobs/:id/lease/renew · /lease/release
|
||
Factories POST /fleet/factories/heartbeat · /enroll
|
||
POST /fleet/factories/:id/token/rotate · /token/revoke
|
||
Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain
|
||
Review POST /fleet/jobs/:id/review/request · /review
|
||
Budgets GET /fleet/budgets/:productId · /burndown
|
||
PUT /fleet/budgets/:productId · POST /pause · /resume
|
||
DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
|
||
Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
|
||
Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo
|
||
Metrics GET /fleet/metrics · GET /fleet/queue-state (Phase-4 M0 RU gate)
|
||
```
|
||
|
||
---
|
||
|
||
## 10. The two control planes & feature flags
|
||
|
||
**Browser (`tracker-web`)** — `dashboards/tracker-web/src/`:
|
||
- `app/dashboard/fleet/page.tsx` — fleet map (factory cards, health/load/caps, metrics + alerts)
|
||
- `app/dashboard/fleet/jobs/page.tsx` — stage-filtered job table
|
||
- `app/dashboard/fleet/jobs/[id]/page.tsx` — job detail: SSE event timeline, runs, artifacts, **DAG view**, **ExplainPanel**, **ReviewGateCard**, ship/requeue/reject
|
||
- `app/dashboard/fleet/budget/page.tsx` — burndown chart + pause/resume kill-switch
|
||
- `lib/fleet-client.ts` — typed client; `subscribeJobEvents` (fetch-based SSE w/ auth + `Last-Event-ID` resume + poll fallback); graceful 404 → null
|
||
- `app/api/fleet/[...path]/route.ts` — proxy to platform-service
|
||
|
||
**Terminal (`agent-queue`)** — `learning_ai_devops_tools/agent-queue/`:
|
||
- `dashboard.mjs` (`AQ_FLEET_DASH=1`) → `lib/fleet-dash.mjs` adapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via `/fleet`. Local folder-queue mode byte-for-byte unchanged when the flag is off.
|
||
|
||
**Feature flags**
|
||
|
||
| Flag | Where | Effect |
|
||
| ---- | ----- | ------ |
|
||
| `FLEET_PREEMPTION` | platform-service | enable critical-job preemption + seat limits |
|
||
| `FLEET_BUDGETS` | platform-service | enable budget enforcement + auto-pause |
|
||
| `AQ_FLEET` | factory runner | runner becomes a coordinator factory (claim/report) |
|
||
| `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` | factory runner | route via service / side-effect-free shadow compare |
|
||
| `AQ_FLEET_DASH` | TUI | dashboard sources board from `/fleet` API |
|
||
| `AQ_FLEET_API` / `AQ_FLEET_TOKEN` / `AQ_PRODUCT_ID` | both | base URL / bearer / `x-product-id` |
|
||
|
||
All flags default **off** → the system is byte-for-byte the prior single-host tool.
|
||
|
||
---
|
||
|
||
## 11. Code map (where everything lives)
|
||
|
||
**`learning_ai_common_plat` (the durable spine):**
|
||
```
|
||
services/platform-service/src/modules/fleet/
|
||
types.ts Zod schemas + canonical model (stages, lease, budget, DAG, events)
|
||
repository.ts per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent
|
||
coordinator.ts submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics
|
||
scheduler.ts pure scoring router + selectPreemptionVictim + scoreCandidate (explain)
|
||
enrollment.ts factory enroll / rotate / revoke / enforceFactoryToken
|
||
tracker-bridge.ts ingest tracker item → job; one-way status echo
|
||
artifacts.ts artifact pointer mgmt
|
||
artifacts-blob.ts blob upload/download/delete (SAS)
|
||
routes.ts all /fleet REST + SSE
|
||
*.test.ts coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types
|
||
dashboards/tracker-web/src/
|
||
app/dashboard/fleet/** the browser control plane (pages above)
|
||
lib/fleet-client.ts typed client + SSE
|
||
app/api/fleet/[...path]/route.ts proxy
|
||
e2e/fleet.spec.ts Playwright specs
|
||
lib/cosmos-init.ts container registration
|
||
docs/GIGAFACTORY/gigafactory-phase3-progress.md / docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md
|
||
```
|
||
|
||
**`learning_ai_devops_tools` (the factory agent + TUI + spec):**
|
||
```
|
||
agent-queue/
|
||
agent-queue.sh single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover
|
||
lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware)
|
||
lib/fleet-dash.mjs TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions)
|
||
dashboard.mjs the TUI (local + fleet modes)
|
||
profiles/*.md persona+capability catalog
|
||
demo/two-factory-demo.sh + coordinator-stub.sh parallel-fleet demo
|
||
selftest.sh ~75 dependency-light checks
|
||
docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md source-of-truth spec & checklists
|
||
docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md (this file)
|
||
```
|
||
|
||
---
|
||
|
||
## 12. Test coverage (what's verified)
|
||
|
||
- **platform-service fleet** (~134+ tests): atomic-claim race (true concurrency, no
|
||
double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring
|
||
/ tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree,
|
||
budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle +
|
||
auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain),
|
||
schema validation.
|
||
- **tracker-web** (~198 tests): fleet-client unit tests + page render; SSE
|
||
parse/resume/fallback; graceful 404 degradation.
|
||
- **tracker-web e2e** (`e2e/fleet.spec.ts`): fleet map, live log, ship, budget-pause,
|
||
review-gate (Playwright — needs CI wiring).
|
||
- **agent-queue** (`selftest.sh`, ~75 checks): manifest/profiles/caps/priority/deps/
|
||
idempotency, retry/recover/insights, tracker round-trip, `AQ_FLEET` register/claim/
|
||
fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo,
|
||
**budget.wall enforcement**, **fleet-dash adapter (22 assertions)**.
|
||
|
||
---
|
||
|
||
## 13. Next steps
|
||
|
||
**Immediate (close Phase 1–3 to a clean 100%):**
|
||
1. **Validate the Cosmos `_etag`/`If-Match` CAS path under true contention** and
|
||
**live blob-backed `fleet_artifacts`** — the two items the roadmap marks as
|
||
"remaining for a hard 100%" on Phase 2/3 (tests today use the memory provider +
|
||
pointer-only artifacts).
|
||
2. **Wire `e2e/fleet.spec.ts` into CI** (Playwright install + a `verify` job) so the
|
||
Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just
|
||
present.
|
||
3. **Live multi-host operator run** end-to-end (the Phase-3 acceptance: drive the
|
||
3-repo parallel workload from the browser, including a budget pause + resume
|
||
against a real platform-service, not the stub).
|
||
|
||
**Phase 4 (scale-out) — in progress; see [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md):**
|
||
- ✅ **M0 (done)** — RU gate: `fleet_queue_state` + `GET /fleet/queue-state` +
|
||
`AQ_FLEET_GATE`; factories skip the claim while the queue version is unchanged.
|
||
4. **M1+: broker** (the redesign picks **Azure Service Bus**, not NATS/Redis, for
|
||
subscription filters + DLQ) for push dispatch + backpressure in a
|
||
coordinator-owns-scheduling / broker-owns-delivery hybrid (keeps the scorer).
|
||
5. **M3: autoscaling** — scale-to-zero ephemeral factories (KEDA/Container Apps)
|
||
keyed to subscription depth.
|
||
6. **Capability marketplace** — route rare-capability jobs (xcode/figma/gpu) to the
|
||
few factories that have them; cross-product queueing fairness.
|
||
7. **Load + chaos suite** — factory churn, broker outage, thundering herd.
|
||
|
||
**Phase 5 (learned routing):**
|
||
8. Capture per-run outcome features → offline eval harness (learned vs heuristic) →
|
||
shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to
|
||
claude on mac-2: 23% faster").
|
||
|
||
---
|
||
|
||
## 14. Bugs, gaps & risks (be honest)
|
||
|
||
**Documentation status (reconciled 2026-05-31):**
|
||
- `GIGAFACTORY_ROADMAP.md` §0 now reads Phase 0 ✅100% · 1 ✅~98% · 2 ✅~98% ·
|
||
3 ✅100% · **4 ◐ in progress (~10%, M0 shipped)** · 5 ☐. Phase-2 boxes for the
|
||
scheduler core and factory enrollment/scoped tokens are ticked (`scheduler.ts`
|
||
`selectJob`/`selectPreemptionVictim` wired into `claimNextJob`; `enrollment.ts`
|
||
`enforceFactoryToken` gating claim/heartbeat). The earlier "stale §0 table"
|
||
warning no longer applies.
|
||
|
||
**Runtime / correctness gaps:**
|
||
- **SSE is poll-fallback based, not a push-only contract.** `subscribeJobEvents`
|
||
falls back to `getJobEvents()` polling on stream error — fine for resilience, but
|
||
"live" can silently degrade to polling without a visible operator signal.
|
||
- **UI pages degrade silently on some errors** (empty states / `null`), which can
|
||
mask a real backend outage as "nothing happening."
|
||
- **Budget page assumes `ceilingUsd` exists** when rendering the spend bar — a
|
||
budget doc without a ceiling could render a broken/NaN bar. Guard it.
|
||
- **Dashboard `patchJob` only sends `{stage, leaseEpoch}`** — other fenced-transition
|
||
fields (e.g. `checkpoint`) aren't exposed in the web UI, so operator-driven
|
||
transitions can't carry a checkpoint.
|
||
- **`rev` CAS on the memory provider** is exact only for the sequential calls the
|
||
coordinator/tests make (re-read `rev` before write). Real concurrency safety
|
||
depends on Cosmos `_etag`/`If-Match` in production — verify the Cosmos path under
|
||
true contention before relying on it at scale.
|
||
|
||
**TUI-specific (this repo):**
|
||
- Fleet **utilization %** only renders in the metrics-aggregate fallback branch, not
|
||
when per-factory rows are present — a minor inconsistency in the TUI board.
|
||
- The **budget.wall live selftest is timing-sensitive** (races a 2s wall ceiling) and
|
||
can flake under heavy disk/CPU load; the code is correct but the test could be made
|
||
more robust (e.g. inject the clock).
|
||
- TUI fleet mode has **no write path for budgets/preemption** — it's read + job
|
||
actions only; budget pause/resume is web-only.
|
||
|
||
**Operational gotchas (verified on the live fleet — get these right):**
|
||
- **Heartbeat cadence MUST be < the 90s stale threshold.** `fleet_metrics` marks a
|
||
factory stale after `DEFAULT_STALE_FACTORY_MS = 90_000`, but the factory only
|
||
heartbeats every `AQ_FLEET_LEASE_RENEW_SEC` (**default 300s**). Left at the
|
||
default, a healthy factory flaps to "stale"/"no live factory" between beats. The
|
||
fleet launcher sets `AQ_FLEET_LEASE_RENEW_SEC=30` to stay well inside the window.
|
||
- **The tracker-web New-Job form is misconfigured:** it hardcodes factories
|
||
`mac-1`/`mac-2` and defaults `capabilities=["build"]` — a token **no agent-queue
|
||
factory advertises** (`detect_capabilities` emits `os:*`/`engine:*`/`node:*`/`has:*`).
|
||
So a default UI submission is unroutable (queues forever → `queue_starvation`).
|
||
Fix tracked in the redesign doc's routing-model section.
|
||
- **No factory deregister API.** Only heartbeat/enroll/rotate/revoke exist, so a
|
||
dead factory's doc lingers and shows as `stale` until pruned out-of-band
|
||
(currently a manual Cosmos delete). A prune/deregister path is a Phase-4 item.
|
||
|
||
**Not-yet-built (expected, Phase 4+):**
|
||
- **No message bus yet** — dispatch is still poll-based, but the **M0 RU gate now
|
||
skips the claim while idle** (so idle Cosmos RU is near-flat). Broker push/
|
||
backpressure is M1+.
|
||
- **No autoscaling** — factory fleet is static/manually run (M3 target).
|
||
- **No capability marketplace / cross-product fairness** under contention.
|
||
- **No load/chaos test suite** — resilience is unit-proven, not load-proven.
|
||
- **Artifacts blob wiring** (`fleet_artifacts` → real blob storage) should be
|
||
validated against a live storage account (tests use memory/pointer only).
|
||
|
||
**Recently fixed (2026-05-31):**
|
||
- **`run --once` could return before a backgrounded worker finished the PR/report.**
|
||
`_meta_end` (which writes `ended=`) was called right after the `testing/` move,
|
||
*before* PR open/merge + coordinator reports, so the slot freed early and `--once`
|
||
could exit (and a caller could observe completion) mid-PR. Now `ended=` is written
|
||
last; the selftest PR-mode case is deterministic again.
|
||
|
||
---
|
||
|
||
## 15. TL;DR
|
||
|
||
Phases 0–3 are functionally **complete and well-tested**: a durable coordinator with
|
||
exactly-once leasing + fencing + crash recovery, a deterministic scoring router with
|
||
preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer
|
||
gate, factory enrollment with scoped tokens, and **two** control planes (browser +
|
||
TUI) over one `/fleet` API. The remaining work is (a) trivial doc corrections, (b)
|
||
CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier
|
||
(broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.
|