bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md
saravanakumardb1 2993994273 docs(gigafactory): reconcile overview + roadmap to current reality
- System overview: mark Phase 4 in-progress (M0 RU gate shipped), add
  fleet_queue_state container + GET /fleet/queue-state, document the heartbeat
  cadence vs 90s stale gotcha, the tracker-web caps=build form bug, the missing
  deregister API, and the ended=-race fix; drop the now-false "roadmap §0 stale"
  and "boxes 384/386 unticked" claims (both reconciled); link the redesign doc.
- Roadmap: §0 Phase 4 -> in progress (M0); align the Phase-2 §8 spec endpoint
  sketches to the as-built API (/fleet/factories/enroll, /factories/heartbeat,
  /fleet/claim) + note the heartbeat cadence and the M0 gate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:02:45 -07:00

452 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Gigafactory — System Overview (current picture)
> Companion to `GIGAFACTORY_ROADMAP.md` (the source-of-truth spec & checklists).
> This document describes **what is actually built today**, how the pieces fit
> together, the architecture diagrams, the code map across both repos, the next
> steps, and the known bugs/gaps. Last reviewed: **2026-05-31**.
>
> The **Phase-4 plan + the as-built M0 RU gate** live in
> [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) — read it for the
> broker-backed dispatch design and the migration checklist.
---
## 1. What it is (in one paragraph)
The **Agent Gigafactory** turns a single-host "folder queue" agent runner into a
**distributed fleet** of agent "factories" (machines: mac/ubuntu/windows) that
claim and execute coding jobs in parallel, coordinated by a durable,
product-agnostic service. A job is a markdown manifest (persona + capabilities +
budget + deps); the **coordinator** assigns each job to the best-fit factory via a
deterministic scoring router, guarantees **exactly-once assignment** through
optimistic-concurrency claims + **leases with epoch fencing**, recovers crashed
work automatically (reaper + WIP checkpoints), enforces **per-product budgets**,
supports **DAG decomposition** (composite → child jobs), and exposes the whole
fleet through **two control planes**: a browser UI (`tracker-web`) and a terminal
TUI (`agent-queue` dashboard). Both control planes talk to the same `/fleet` REST
API.
---
## 2. Completion snapshot (reality, not the stale table)
| Phase | Theme | Real status | Notes |
| ----- | ----- | ----------- | ----- |
| **0** | Single-host baseline | ✅ 100% | `agent-queue.sh` folder queue, selftest green |
| **1** | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node `dash` field surfacing — **now also done** via fleet-dash tags. Effectively complete |
| **2** | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are **done in code** but boxes 384/386 unticked in roadmap (see §11 Gaps) |
| **3** | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run |
| **4** | Message bus + autoscaling + capability marketplace | 🟡 in progress | **M0 (RU gate) shipped** — see below. Broker (M1+) not started. Plan: [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) |
| **5** | Self-optimizing / learned routing | ☐ 0% | Not started |
> **Phase-4 M0 (RU gate) is live (2026-05-31):** a per-product `fleet_queue_state`
> doc holds a monotonic `version` (bumped on job create + every stage change);
> factories with `AQ_FLEET_GATE=1` point-read `GET /fleet/queue-state` (~1 RU) and
> skip the expensive claim while nothing changed — cutting idle Cosmos RU without
> raising the local poll interval. Default OFF; the live fleet runs it on.
---
## 3. System architecture
```mermaid
graph TB
subgraph CP["Control planes (operators)"]
WEB["tracker-web Fleet UI<br/>(Next.js, /dashboard/fleet/*)"]
TUI["agent-queue TUI<br/>(dashboard.mjs, AQ_FLEET_DASH=1)"]
end
subgraph SVC["platform-service — fleet module (the spine)"]
ROUTES["routes.ts<br/>/fleet REST + SSE"]
COORD["coordinator.ts<br/>claim · lease · fence · reaper<br/>preemption · budgets · DAG · review"]
SCHED["scheduler.ts<br/>pure scoring router (§7)"]
ENROLL["enrollment.ts<br/>factory tokens (scoped, rotatable)"]
BRIDGE["tracker-bridge.ts<br/>job ↔ tracker item"]
ARTIF["artifacts.ts / artifacts-blob.ts<br/>pointer + blob bytes"]
REPO["repository.ts<br/>CAS (rev/_etag) CRUD"]
end
subgraph DATA["@bytelyst/datastore (Cosmos / memory)"]
JOBS[("fleet_jobs")]
RUNS[("fleet_runs")]
LEASES[("fleet_leases")]
FAC[("fleet_factories")]
PROF[("fleet_profiles")]
EVENTS[("fleet_events")]
ARTDOCS[("fleet_artifacts")]
end
subgraph FLEET["Factory agents (workers, N hosts)"]
F1["agent-queue.sh + lib/fleet-client.sh<br/>(AQ_FLEET=1) — mac-1"]
F2["agent-queue.sh + lib/fleet-client.sh<br/>ubuntu-1"]
ENGINES["engines: claude · codex · devin"]
end
WEB -->|/api/fleet proxy| ROUTES
TUI -->|lib/fleet-dash.mjs| ROUTES
ROUTES --> COORD
COORD --> SCHED
ROUTES --> ENROLL
ROUTES --> BRIDGE
ROUTES --> ARTIF
COORD --> REPO
ENROLL --> REPO
BRIDGE --> REPO
ARTIF --> ARTDOCS
REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS
F1 -->|heartbeat · claim · patch fenced · renew| ROUTES
F2 -->|heartbeat · claim · patch fenced · renew| ROUTES
F1 --> ENGINES
F2 --> ENGINES
```
**Layering principle:** `scheduler.ts` is **pure** (no I/O — all inputs passed
in), `coordinator.ts` is the orchestration core, `repository.ts` is the only thing
that touches the datastore, and `routes.ts` is the only thing that touches HTTP.
Factories never touch the DB directly — they only call REST.
---
## 4. Job lifecycle (stages)
```mermaid
stateDiagram-v2
[*] --> queued: submitJob
queued --> blocked: unmet deps
blocked --> queued: deps satisfied (reaper/unblock)
queued --> assigned: claimNextJob (CAS win + lease)
assigned --> building: factory starts (patch fenced)
building --> review: rc=0 → review gate
building --> testing: verify-pass (auto)
review --> testing: approve / requestReview quorum
testing --> shipped: ship (manual gate)
building --> failed: verify-fail / budget_exceeded / timeout
review --> failed: reject
assigned --> queued: lease expired (reaper, +epoch, keep checkpoint)
building --> queued: preempted (critical job, checkpoint + epoch bump)
failed --> queued: requeue (operator)
failed --> dead_letter: retries exhausted
shipped --> [*]
dead_letter --> [*]
```
Stages (`types.ts`): `queued · blocked · assigned · building · review · testing ·
shipped · failed · dead_letter`. The TUI/local board collapse these onto kanban
buckets (`inbox/building/review/testing/shipped/failed`) for parity.
---
## 5. The core guarantee — atomic claim + lease fencing
This is the heart of "no double-assignment, ever" and "a dead worker can never
corrupt a reassigned job."
```mermaid
sequenceDiagram
participant FA as Factory A
participant FB as Factory B
participant CO as coordinator
participant DB as fleet_jobs / fleet_leases
FA->>CO: POST /fleet/claim (caps)
FB->>CO: POST /fleet/claim (caps)
CO->>DB: selectJob() → job J (rev=5)
CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS)
DB-->>CO: A wins (rev→6, leaseEpoch=1)
CO->>DB: revUpdate J IF rev==5 (B's CAS)
DB-->>CO: conflict (B re-selects)
CO-->>FA: assigned J (leaseEpoch=1)
CO-->>FB: conflict → next job
Note over FA: A crashes mid-build
CO->>DB: reapExpiredLeases(): lease expired → J back to queued,<br/>leaseEpoch=2, checkpoint preserved
FB->>CO: claim → J (leaseEpoch=2)
FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1
CO-->>FA: 409 fenced (1 < 2) — rejected
```
- **CAS:** `repository.revUpdateJob/revUpdateLease` write only if stored `rev`
matches (Cosmos `_etag`/`If-Match`; memory provider re-reads `rev`).
- **Fencing:** every worker mutation carries `leaseEpoch`; epoch `< job.leaseEpoch`
`fenced` (409).
- **Reaper:** `reapExpiredLeases(now)` requeues expired-lease jobs, **bumps the
epoch**, and **keeps the `checkpoint`** (WIP git branch pointer) so work resumes
rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery.
---
## 6. Data model (Cosmos containers)
| Container | PK | Purpose |
| --------- | -- | ------- |
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `depsMode`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, `kind`, `parentId` |
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
| `fleet_factories` | `/productId` | worker host: `capabilities[]`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
| `fleet_profiles` | `/productId` | immutable, versioned persona/capability profile snapshot |
| `fleet_events` | `/jobId` | append-only audit stream (monotonic `seq`) — powers SSE |
| `fleet_artifacts` | `/jobId` | **pointers** to blob-stored artifacts (no inline logs) |
| `fleet_queue_state` | `/productId` | **Phase-4 M0 RU gate**: monotonic `version` bumped on job create + every stage change; read via `GET /fleet/queue-state` so a factory can cheaply detect "work changed" |
Every document carries `productId`. Containers registered in `lib/cosmos-init.ts`.
---
## 7. The scheduler / scoring router (`scheduler.ts`)
Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in
Phase 5). Filter → score → rank:
```
score = w1·capabilityFit + w2·affinity + w3·(1/(1+load))
+ w4·costFit(budget) + w5·health w6·starvationPenalty(age)
```
Default weights (`DEFAULT_WEIGHTS`): `capabilityFit 1.0 · affinity 0.5 · load 1.0
· costFit 0.75 · health 1.0 · starvation 1.5`. Capability is a **hard filter**
(subset check); `down` factories are filtered out, not scored; aging fully
de-penalises after ~30 min (anti-starvation). `scoreCandidate` returns a per-term
breakdown that powers the **explainability** panel (`GET /fleet/jobs/:id/explain`
`ExplainPanel`). `selectPreemptionVictim` picks the lowest-priority running job a
critical job may evict (under `FLEET_PREEMPTION`).
---
## 8. Subsystems at a glance
| Subsystem | File(s) | What it does | Flag |
| --------- | ------- | ------------ | ---- |
| Claim / lease / fence / reaper | `coordinator.ts` | exactly-once assignment, recovery | — |
| Scoring router + preemption | `scheduler.ts`, `coordinator.ts` | best-fit assignment, evict low-pri for critical | `FLEET_PREEMPTION` |
| Per-product budgets | `coordinator.ts` (`accrueSpend`, `pause/resume`) | ceiling + auto-pause kill-switch; burndown | `FLEET_BUDGETS` |
| DAG decomposition | `coordinator.ts` (`submitChildren`, `getDagSubtree`, `maybeUnblockParent`) | composite job fans out to children; deps gate parent | — |
| Review gate | `coordinator.ts` (`requestReview`, `submitReview`) | multi-reviewer quorum before ship | — |
| Factory enrollment | `enrollment.ts` | scoped, rotatable, hashed tokens; auth on claim/heartbeat | — |
| Tracker bridge | `tracker-bridge.ts` | idempotent ingest of tracker item → job; one-way status echo | — |
| Artifacts | `artifacts.ts`, `artifacts-blob.ts` | pointer docs in Cosmos, bytes in blob (SAS) | — |
| Live events | `routes.ts` SSE + `fleet_events` | `GET /fleet/jobs/:id/events/stream` | — |
| Metrics / alerts | `coordinator.ts` (`fleetMetrics`) | utilization, health rollup, starvation alerts | — |
---
## 9. REST API surface (`/fleet`, under `/api`, auth + `x-product-id`)
```
Jobs POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id
PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action
Claim POST /fleet/claim
Lease POST /fleet/jobs/:id/lease/renew · /lease/release
Factories POST /fleet/factories/heartbeat · /enroll
POST /fleet/factories/:id/token/rotate · /token/revoke
Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain
Review POST /fleet/jobs/:id/review/request · /review
Budgets GET /fleet/budgets/:productId · /burndown
PUT /fleet/budgets/:productId · POST /pause · /resume
DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo
Metrics GET /fleet/metrics · GET /fleet/queue-state (Phase-4 M0 RU gate)
```
---
## 10. The two control planes & feature flags
**Browser (`tracker-web`)**`dashboards/tracker-web/src/`:
- `app/dashboard/fleet/page.tsx` — fleet map (factory cards, health/load/caps, metrics + alerts)
- `app/dashboard/fleet/jobs/page.tsx` — stage-filtered job table
- `app/dashboard/fleet/jobs/[id]/page.tsx` — job detail: SSE event timeline, runs, artifacts, **DAG view**, **ExplainPanel**, **ReviewGateCard**, ship/requeue/reject
- `app/dashboard/fleet/budget/page.tsx` — burndown chart + pause/resume kill-switch
- `lib/fleet-client.ts` — typed client; `subscribeJobEvents` (fetch-based SSE w/ auth + `Last-Event-ID` resume + poll fallback); graceful 404 → null
- `app/api/fleet/[...path]/route.ts` — proxy to platform-service
**Terminal (`agent-queue`)**`learning_ai_devops_tools/agent-queue/`:
- `dashboard.mjs` (`AQ_FLEET_DASH=1`) → `lib/fleet-dash.mjs` adapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via `/fleet`. Local folder-queue mode byte-for-byte unchanged when the flag is off.
**Feature flags**
| Flag | Where | Effect |
| ---- | ----- | ------ |
| `FLEET_PREEMPTION` | platform-service | enable critical-job preemption + seat limits |
| `FLEET_BUDGETS` | platform-service | enable budget enforcement + auto-pause |
| `AQ_FLEET` | factory runner | runner becomes a coordinator factory (claim/report) |
| `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` | factory runner | route via service / side-effect-free shadow compare |
| `AQ_FLEET_DASH` | TUI | dashboard sources board from `/fleet` API |
| `AQ_FLEET_API` / `AQ_FLEET_TOKEN` / `AQ_PRODUCT_ID` | both | base URL / bearer / `x-product-id` |
All flags default **off** → the system is byte-for-byte the prior single-host tool.
---
## 11. Code map (where everything lives)
**`learning_ai_common_plat` (the durable spine):**
```
services/platform-service/src/modules/fleet/
types.ts Zod schemas + canonical model (stages, lease, budget, DAG, events)
repository.ts per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent
coordinator.ts submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics
scheduler.ts pure scoring router + selectPreemptionVictim + scoreCandidate (explain)
enrollment.ts factory enroll / rotate / revoke / enforceFactoryToken
tracker-bridge.ts ingest tracker item → job; one-way status echo
artifacts.ts artifact pointer mgmt
artifacts-blob.ts blob upload/download/delete (SAS)
routes.ts all /fleet REST + SSE
*.test.ts coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types
dashboards/tracker-web/src/
app/dashboard/fleet/** the browser control plane (pages above)
lib/fleet-client.ts typed client + SSE
app/api/fleet/[...path]/route.ts proxy
e2e/fleet.spec.ts Playwright specs
lib/cosmos-init.ts container registration
docs/GIGAFACTORY/gigafactory-phase3-progress.md / docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md
```
**`learning_ai_devops_tools` (the factory agent + TUI + spec):**
```
agent-queue/
agent-queue.sh single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover
lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware)
lib/fleet-dash.mjs TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions)
dashboard.mjs the TUI (local + fleet modes)
profiles/*.md persona+capability catalog
demo/two-factory-demo.sh + coordinator-stub.sh parallel-fleet demo
selftest.sh ~75 dependency-light checks
docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md source-of-truth spec & checklists
docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md (this file)
```
---
## 12. Test coverage (what's verified)
- **platform-service fleet** (~134+ tests): atomic-claim race (true concurrency, no
double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring
/ tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree,
budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle +
auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain),
schema validation.
- **tracker-web** (~198 tests): fleet-client unit tests + page render; SSE
parse/resume/fallback; graceful 404 degradation.
- **tracker-web e2e** (`e2e/fleet.spec.ts`): fleet map, live log, ship, budget-pause,
review-gate (Playwright — needs CI wiring).
- **agent-queue** (`selftest.sh`, ~75 checks): manifest/profiles/caps/priority/deps/
idempotency, retry/recover/insights, tracker round-trip, `AQ_FLEET` register/claim/
fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo,
**budget.wall enforcement**, **fleet-dash adapter (22 assertions)**.
---
## 13. Next steps
**Immediate (close Phase 13 to a clean 100%):**
1. **Validate the Cosmos `_etag`/`If-Match` CAS path under true contention** and
**live blob-backed `fleet_artifacts`** — the two items the roadmap marks as
"remaining for a hard 100%" on Phase 2/3 (tests today use the memory provider +
pointer-only artifacts).
2. **Wire `e2e/fleet.spec.ts` into CI** (Playwright install + a `verify` job) so the
Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just
present.
3. **Live multi-host operator run** end-to-end (the Phase-3 acceptance: drive the
3-repo parallel workload from the browser, including a budget pause + resume
against a real platform-service, not the stub).
**Phase 4 (scale-out) — in progress; see [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md):**
-**M0 (done)** — RU gate: `fleet_queue_state` + `GET /fleet/queue-state` +
`AQ_FLEET_GATE`; factories skip the claim while the queue version is unchanged.
4. **M1+: broker** (the redesign picks **Azure Service Bus**, not NATS/Redis, for
subscription filters + DLQ) for push dispatch + backpressure in a
coordinator-owns-scheduling / broker-owns-delivery hybrid (keeps the scorer).
5. **M3: autoscaling** — scale-to-zero ephemeral factories (KEDA/Container Apps)
keyed to subscription depth.
6. **Capability marketplace** — route rare-capability jobs (xcode/figma/gpu) to the
few factories that have them; cross-product queueing fairness.
7. **Load + chaos suite** — factory churn, broker outage, thundering herd.
**Phase 5 (learned routing):**
8. Capture per-run outcome features → offline eval harness (learned vs heuristic) →
shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to
claude on mac-2: 23% faster").
---
## 14. Bugs, gaps & risks (be honest)
**Documentation status (reconciled 2026-05-31):**
- `GIGAFACTORY_ROADMAP.md` §0 now reads Phase 0 ✅100% · 1 ✅~98% · 2 ✅~98% ·
3 ✅100% · **4 ◐ in progress (~10%, M0 shipped)** · 5 ☐. Phase-2 boxes for the
scheduler core and factory enrollment/scoped tokens are ticked (`scheduler.ts`
`selectJob`/`selectPreemptionVictim` wired into `claimNextJob`; `enrollment.ts`
`enforceFactoryToken` gating claim/heartbeat). The earlier "stale §0 table"
warning no longer applies.
**Runtime / correctness gaps:**
- **SSE is poll-fallback based, not a push-only contract.** `subscribeJobEvents`
falls back to `getJobEvents()` polling on stream error — fine for resilience, but
"live" can silently degrade to polling without a visible operator signal.
- **UI pages degrade silently on some errors** (empty states / `null`), which can
mask a real backend outage as "nothing happening."
- **Budget page assumes `ceilingUsd` exists** when rendering the spend bar — a
budget doc without a ceiling could render a broken/NaN bar. Guard it.
- **Dashboard `patchJob` only sends `{stage, leaseEpoch}`** — other fenced-transition
fields (e.g. `checkpoint`) aren't exposed in the web UI, so operator-driven
transitions can't carry a checkpoint.
- **`rev` CAS on the memory provider** is exact only for the sequential calls the
coordinator/tests make (re-read `rev` before write). Real concurrency safety
depends on Cosmos `_etag`/`If-Match` in production — verify the Cosmos path under
true contention before relying on it at scale.
**TUI-specific (this repo):**
- Fleet **utilization %** only renders in the metrics-aggregate fallback branch, not
when per-factory rows are present — a minor inconsistency in the TUI board.
- The **budget.wall live selftest is timing-sensitive** (races a 2s wall ceiling) and
can flake under heavy disk/CPU load; the code is correct but the test could be made
more robust (e.g. inject the clock).
- TUI fleet mode has **no write path for budgets/preemption** — it's read + job
actions only; budget pause/resume is web-only.
**Operational gotchas (verified on the live fleet — get these right):**
- **Heartbeat cadence MUST be < the 90s stale threshold.** `fleet_metrics` marks a
factory stale after `DEFAULT_STALE_FACTORY_MS = 90_000`, but the factory only
heartbeats every `AQ_FLEET_LEASE_RENEW_SEC` (**default 300s**). Left at the
default, a healthy factory flaps to "stale"/"no live factory" between beats. The
fleet launcher sets `AQ_FLEET_LEASE_RENEW_SEC=30` to stay well inside the window.
- **The tracker-web New-Job form is misconfigured:** it hardcodes factories
`mac-1`/`mac-2` and defaults `capabilities=["build"]` a token **no agent-queue
factory advertises** (`detect_capabilities` emits `os:*`/`engine:*`/`node:*`/`has:*`).
So a default UI submission is unroutable (queues forever `queue_starvation`).
Fix tracked in the redesign doc's routing-model section.
- **No factory deregister API.** Only heartbeat/enroll/rotate/revoke exist, so a
dead factory's doc lingers and shows as `stale` until pruned out-of-band
(currently a manual Cosmos delete). A prune/deregister path is a Phase-4 item.
**Not-yet-built (expected, Phase 4+):**
- **No message bus yet** dispatch is still poll-based, but the **M0 RU gate now
skips the claim while idle** (so idle Cosmos RU is near-flat). Broker push/
backpressure is M1+.
- **No autoscaling** factory fleet is static/manually run (M3 target).
- **No capability marketplace / cross-product fairness** under contention.
- **No load/chaos test suite** resilience is unit-proven, not load-proven.
- **Artifacts blob wiring** (`fleet_artifacts` real blob storage) should be
validated against a live storage account (tests use memory/pointer only).
**Recently fixed (2026-05-31):**
- **`run --once` could return before a backgrounded worker finished the PR/report.**
`_meta_end` (which writes `ended=`) was called right after the `testing/` move,
*before* PR open/merge + coordinator reports, so the slot freed early and `--once`
could exit (and a caller could observe completion) mid-PR. Now `ended=` is written
last; the selftest PR-mode case is deterministic again.
---
## 15. TL;DR
Phases 03 are functionally **complete and well-tested**: a durable coordinator with
exactly-once leasing + fencing + crash recovery, a deterministic scoring router with
preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer
gate, factory enrollment with scoped tokens, and **two** control planes (browser +
TUI) over one `/fleet` API. The remaining work is (a) trivial doc corrections, (b)
CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier
(broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.