- System overview: mark Phase 4 in-progress (M0 RU gate shipped), add fleet_queue_state container + GET /fleet/queue-state, document the heartbeat cadence vs 90s stale gotcha, the tracker-web caps=build form bug, the missing deregister API, and the ended=-race fix; drop the now-false "roadmap §0 stale" and "boxes 384/386 unticked" claims (both reconciled); link the redesign doc. - Roadmap: §0 Phase 4 -> in progress (M0); align the Phase-2 §8 spec endpoint sketches to the as-built API (/fleet/factories/enroll, /factories/heartbeat, /fleet/claim) + note the heartbeat cadence and the M0 gate. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
23 KiB
Agent Gigafactory — System Overview (current picture)
Companion to
GIGAFACTORY_ROADMAP.md(the source-of-truth spec & checklists). This document describes what is actually built today, how the pieces fit together, the architecture diagrams, the code map across both repos, the next steps, and the known bugs/gaps. Last reviewed: 2026-05-31.The Phase-4 plan + the as-built M0 RU gate live in
FLEET_DISPATCH_REDESIGN.md— read it for the broker-backed dispatch design and the migration checklist.
1. What it is (in one paragraph)
The Agent Gigafactory turns a single-host "folder queue" agent runner into a
distributed fleet of agent "factories" (machines: mac/ubuntu/windows) that
claim and execute coding jobs in parallel, coordinated by a durable,
product-agnostic service. A job is a markdown manifest (persona + capabilities +
budget + deps); the coordinator assigns each job to the best-fit factory via a
deterministic scoring router, guarantees exactly-once assignment through
optimistic-concurrency claims + leases with epoch fencing, recovers crashed
work automatically (reaper + WIP checkpoints), enforces per-product budgets,
supports DAG decomposition (composite → child jobs), and exposes the whole
fleet through two control planes: a browser UI (tracker-web) and a terminal
TUI (agent-queue dashboard). Both control planes talk to the same /fleet REST
API.
2. Completion snapshot (reality, not the stale table)
| Phase | Theme | Real status | Notes |
|---|---|---|---|
| 0 | Single-host baseline | ✅ 100% | agent-queue.sh folder queue, selftest green |
| 1 | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node dash field surfacing — now also done via fleet-dash tags. Effectively complete |
| 2 | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are done in code but boxes 384/386 unticked in roadmap (see §11 Gaps) |
| 3 | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run |
| 4 | Message bus + autoscaling + capability marketplace | 🟡 in progress | M0 (RU gate) shipped — see below. Broker (M1+) not started. Plan: FLEET_DISPATCH_REDESIGN.md |
| 5 | Self-optimizing / learned routing | ☐ 0% | Not started |
Phase-4 M0 (RU gate) is live (2026-05-31): a per-product
fleet_queue_statedoc holds a monotonicversion(bumped on job create + every stage change); factories withAQ_FLEET_GATE=1point-readGET /fleet/queue-state(~1 RU) and skip the expensive claim while nothing changed — cutting idle Cosmos RU without raising the local poll interval. Default OFF; the live fleet runs it on.
3. System architecture
graph TB
subgraph CP["Control planes (operators)"]
WEB["tracker-web Fleet UI<br/>(Next.js, /dashboard/fleet/*)"]
TUI["agent-queue TUI<br/>(dashboard.mjs, AQ_FLEET_DASH=1)"]
end
subgraph SVC["platform-service — fleet module (the spine)"]
ROUTES["routes.ts<br/>/fleet REST + SSE"]
COORD["coordinator.ts<br/>claim · lease · fence · reaper<br/>preemption · budgets · DAG · review"]
SCHED["scheduler.ts<br/>pure scoring router (§7)"]
ENROLL["enrollment.ts<br/>factory tokens (scoped, rotatable)"]
BRIDGE["tracker-bridge.ts<br/>job ↔ tracker item"]
ARTIF["artifacts.ts / artifacts-blob.ts<br/>pointer + blob bytes"]
REPO["repository.ts<br/>CAS (rev/_etag) CRUD"]
end
subgraph DATA["@bytelyst/datastore (Cosmos / memory)"]
JOBS[("fleet_jobs")]
RUNS[("fleet_runs")]
LEASES[("fleet_leases")]
FAC[("fleet_factories")]
PROF[("fleet_profiles")]
EVENTS[("fleet_events")]
ARTDOCS[("fleet_artifacts")]
end
subgraph FLEET["Factory agents (workers, N hosts)"]
F1["agent-queue.sh + lib/fleet-client.sh<br/>(AQ_FLEET=1) — mac-1"]
F2["agent-queue.sh + lib/fleet-client.sh<br/>ubuntu-1"]
ENGINES["engines: claude · codex · devin"]
end
WEB -->|/api/fleet proxy| ROUTES
TUI -->|lib/fleet-dash.mjs| ROUTES
ROUTES --> COORD
COORD --> SCHED
ROUTES --> ENROLL
ROUTES --> BRIDGE
ROUTES --> ARTIF
COORD --> REPO
ENROLL --> REPO
BRIDGE --> REPO
ARTIF --> ARTDOCS
REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS
F1 -->|heartbeat · claim · patch fenced · renew| ROUTES
F2 -->|heartbeat · claim · patch fenced · renew| ROUTES
F1 --> ENGINES
F2 --> ENGINES
Layering principle: scheduler.ts is pure (no I/O — all inputs passed
in), coordinator.ts is the orchestration core, repository.ts is the only thing
that touches the datastore, and routes.ts is the only thing that touches HTTP.
Factories never touch the DB directly — they only call REST.
4. Job lifecycle (stages)
stateDiagram-v2
[*] --> queued: submitJob
queued --> blocked: unmet deps
blocked --> queued: deps satisfied (reaper/unblock)
queued --> assigned: claimNextJob (CAS win + lease)
assigned --> building: factory starts (patch fenced)
building --> review: rc=0 → review gate
building --> testing: verify-pass (auto)
review --> testing: approve / requestReview quorum
testing --> shipped: ship (manual gate)
building --> failed: verify-fail / budget_exceeded / timeout
review --> failed: reject
assigned --> queued: lease expired (reaper, +epoch, keep checkpoint)
building --> queued: preempted (critical job, checkpoint + epoch bump)
failed --> queued: requeue (operator)
failed --> dead_letter: retries exhausted
shipped --> [*]
dead_letter --> [*]
Stages (types.ts): queued · blocked · assigned · building · review · testing · shipped · failed · dead_letter. The TUI/local board collapse these onto kanban
buckets (inbox/building/review/testing/shipped/failed) for parity.
5. The core guarantee — atomic claim + lease fencing
This is the heart of "no double-assignment, ever" and "a dead worker can never corrupt a reassigned job."
sequenceDiagram
participant FA as Factory A
participant FB as Factory B
participant CO as coordinator
participant DB as fleet_jobs / fleet_leases
FA->>CO: POST /fleet/claim (caps)
FB->>CO: POST /fleet/claim (caps)
CO->>DB: selectJob() → job J (rev=5)
CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS)
DB-->>CO: A wins (rev→6, leaseEpoch=1)
CO->>DB: revUpdate J IF rev==5 (B's CAS)
DB-->>CO: conflict (B re-selects)
CO-->>FA: assigned J (leaseEpoch=1)
CO-->>FB: conflict → next job
Note over FA: A crashes mid-build
CO->>DB: reapExpiredLeases(): lease expired → J back to queued,<br/>leaseEpoch=2, checkpoint preserved
FB->>CO: claim → J (leaseEpoch=2)
FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1
CO-->>FA: 409 fenced (1 < 2) — rejected
- CAS:
repository.revUpdateJob/revUpdateLeasewrite only if storedrevmatches (Cosmos_etag/If-Match; memory provider re-readsrev). - Fencing: every worker mutation carries
leaseEpoch; epoch< job.leaseEpoch⇒fenced(409). - Reaper:
reapExpiredLeases(now)requeues expired-lease jobs, bumps the epoch, and keeps thecheckpoint(WIP git branch pointer) so work resumes rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery.
6. Data model (Cosmos containers)
| Container | PK | Purpose |
|---|---|---|
fleet_jobs |
/productId |
durable job: manifestSnapshot, verbatim bodyMd, stage, idempotencyKey, deps, depsMode, checkpoint, priority, rev, leaseEpoch, kind, parentId |
fleet_runs |
/jobId |
one execution attempt: engine, timings, result, insights (tokens/cost/diff) |
fleet_leases |
/jobId |
single-holder lease: holderFactoryId, expiresAt, leaseEpoch, status |
fleet_factories |
/productId |
worker host: capabilities[], health, load, seatLimit, lastHeartbeatAt |
fleet_profiles |
/productId |
immutable, versioned persona/capability profile snapshot |
fleet_events |
/jobId |
append-only audit stream (monotonic seq) — powers SSE |
fleet_artifacts |
/jobId |
pointers to blob-stored artifacts (no inline logs) |
fleet_queue_state |
/productId |
Phase-4 M0 RU gate: monotonic version bumped on job create + every stage change; read via GET /fleet/queue-state so a factory can cheaply detect "work changed" |
Every document carries productId. Containers registered in lib/cosmos-init.ts.
7. The scheduler / scoring router (scheduler.ts)
Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in Phase 5). Filter → score → rank:
score = w1·capabilityFit + w2·affinity + w3·(1/(1+load))
+ w4·costFit(budget) + w5·health − w6·starvationPenalty(age)
Default weights (DEFAULT_WEIGHTS): capabilityFit 1.0 · affinity 0.5 · load 1.0 · costFit 0.75 · health 1.0 · starvation 1.5. Capability is a hard filter
(subset check); down factories are filtered out, not scored; aging fully
de-penalises after ~30 min (anti-starvation). scoreCandidate returns a per-term
breakdown that powers the explainability panel (GET /fleet/jobs/:id/explain
→ ExplainPanel). selectPreemptionVictim picks the lowest-priority running job a
critical job may evict (under FLEET_PREEMPTION).
8. Subsystems at a glance
| Subsystem | File(s) | What it does | Flag |
|---|---|---|---|
| Claim / lease / fence / reaper | coordinator.ts |
exactly-once assignment, recovery | — |
| Scoring router + preemption | scheduler.ts, coordinator.ts |
best-fit assignment, evict low-pri for critical | FLEET_PREEMPTION |
| Per-product budgets | coordinator.ts (accrueSpend, pause/resume) |
ceiling + auto-pause kill-switch; burndown | FLEET_BUDGETS |
| DAG decomposition | coordinator.ts (submitChildren, getDagSubtree, maybeUnblockParent) |
composite job fans out to children; deps gate parent | — |
| Review gate | coordinator.ts (requestReview, submitReview) |
multi-reviewer quorum before ship | — |
| Factory enrollment | enrollment.ts |
scoped, rotatable, hashed tokens; auth on claim/heartbeat | — |
| Tracker bridge | tracker-bridge.ts |
idempotent ingest of tracker item → job; one-way status echo | — |
| Artifacts | artifacts.ts, artifacts-blob.ts |
pointer docs in Cosmos, bytes in blob (SAS) | — |
| Live events | routes.ts SSE + fleet_events |
GET /fleet/jobs/:id/events/stream |
— |
| Metrics / alerts | coordinator.ts (fleetMetrics) |
utilization, health rollup, starvation alerts | — |
9. REST API surface (/fleet, under /api, auth + x-product-id)
Jobs POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id
PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action
Claim POST /fleet/claim
Lease POST /fleet/jobs/:id/lease/renew · /lease/release
Factories POST /fleet/factories/heartbeat · /enroll
POST /fleet/factories/:id/token/rotate · /token/revoke
Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain
Review POST /fleet/jobs/:id/review/request · /review
Budgets GET /fleet/budgets/:productId · /burndown
PUT /fleet/budgets/:productId · POST /pause · /resume
DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo
Metrics GET /fleet/metrics · GET /fleet/queue-state (Phase-4 M0 RU gate)
10. The two control planes & feature flags
Browser (tracker-web) — dashboards/tracker-web/src/:
app/dashboard/fleet/page.tsx— fleet map (factory cards, health/load/caps, metrics + alerts)app/dashboard/fleet/jobs/page.tsx— stage-filtered job tableapp/dashboard/fleet/jobs/[id]/page.tsx— job detail: SSE event timeline, runs, artifacts, DAG view, ExplainPanel, ReviewGateCard, ship/requeue/rejectapp/dashboard/fleet/budget/page.tsx— burndown chart + pause/resume kill-switchlib/fleet-client.ts— typed client;subscribeJobEvents(fetch-based SSE w/ auth +Last-Event-IDresume + poll fallback); graceful 404 → nullapp/api/fleet/[...path]/route.ts— proxy to platform-service
Terminal (agent-queue) — learning_ai_devops_tools/agent-queue/:
dashboard.mjs(AQ_FLEET_DASH=1) →lib/fleet-dash.mjsadapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via/fleet. Local folder-queue mode byte-for-byte unchanged when the flag is off.
Feature flags
| Flag | Where | Effect |
|---|---|---|
FLEET_PREEMPTION |
platform-service | enable critical-job preemption + seat limits |
FLEET_BUDGETS |
platform-service | enable budget enforcement + auto-pause |
AQ_FLEET |
factory runner | runner becomes a coordinator factory (claim/report) |
AQ_FLEET_ROUTE / AQ_FLEET_SHADOW |
factory runner | route via service / side-effect-free shadow compare |
AQ_FLEET_DASH |
TUI | dashboard sources board from /fleet API |
AQ_FLEET_API / AQ_FLEET_TOKEN / AQ_PRODUCT_ID |
both | base URL / bearer / x-product-id |
All flags default off → the system is byte-for-byte the prior single-host tool.
11. Code map (where everything lives)
learning_ai_common_plat (the durable spine):
services/platform-service/src/modules/fleet/
types.ts Zod schemas + canonical model (stages, lease, budget, DAG, events)
repository.ts per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent
coordinator.ts submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics
scheduler.ts pure scoring router + selectPreemptionVictim + scoreCandidate (explain)
enrollment.ts factory enroll / rotate / revoke / enforceFactoryToken
tracker-bridge.ts ingest tracker item → job; one-way status echo
artifacts.ts artifact pointer mgmt
artifacts-blob.ts blob upload/download/delete (SAS)
routes.ts all /fleet REST + SSE
*.test.ts coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types
dashboards/tracker-web/src/
app/dashboard/fleet/** the browser control plane (pages above)
lib/fleet-client.ts typed client + SSE
app/api/fleet/[...path]/route.ts proxy
e2e/fleet.spec.ts Playwright specs
lib/cosmos-init.ts container registration
docs/GIGAFACTORY/gigafactory-phase3-progress.md / docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md
learning_ai_devops_tools (the factory agent + TUI + spec):
agent-queue/
agent-queue.sh single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover
lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware)
lib/fleet-dash.mjs TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions)
dashboard.mjs the TUI (local + fleet modes)
profiles/*.md persona+capability catalog
demo/two-factory-demo.sh + coordinator-stub.sh parallel-fleet demo
selftest.sh ~75 dependency-light checks
docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md source-of-truth spec & checklists
docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md (this file)
12. Test coverage (what's verified)
- platform-service fleet (~134+ tests): atomic-claim race (true concurrency, no double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring / tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree, budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle + auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain), schema validation.
- tracker-web (~198 tests): fleet-client unit tests + page render; SSE parse/resume/fallback; graceful 404 degradation.
- tracker-web e2e (
e2e/fleet.spec.ts): fleet map, live log, ship, budget-pause, review-gate (Playwright — needs CI wiring). - agent-queue (
selftest.sh, ~75 checks): manifest/profiles/caps/priority/deps/ idempotency, retry/recover/insights, tracker round-trip,AQ_FLEETregister/claim/ fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo, budget.wall enforcement, fleet-dash adapter (22 assertions).
13. Next steps
Immediate (close Phase 1–3 to a clean 100%):
- Validate the Cosmos
_etag/If-MatchCAS path under true contention and live blob-backedfleet_artifacts— the two items the roadmap marks as "remaining for a hard 100%" on Phase 2/3 (tests today use the memory provider + pointer-only artifacts). - Wire
e2e/fleet.spec.tsinto CI (Playwright install + averifyjob) so the Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just present. - Live multi-host operator run end-to-end (the Phase-3 acceptance: drive the 3-repo parallel workload from the browser, including a budget pause + resume against a real platform-service, not the stub).
Phase 4 (scale-out) — in progress; see FLEET_DISPATCH_REDESIGN.md:
- ✅ M0 (done) — RU gate:
fleet_queue_state+GET /fleet/queue-state+AQ_FLEET_GATE; factories skip the claim while the queue version is unchanged.
- M1+: broker (the redesign picks Azure Service Bus, not NATS/Redis, for subscription filters + DLQ) for push dispatch + backpressure in a coordinator-owns-scheduling / broker-owns-delivery hybrid (keeps the scorer).
- M3: autoscaling — scale-to-zero ephemeral factories (KEDA/Container Apps) keyed to subscription depth.
- Capability marketplace — route rare-capability jobs (xcode/figma/gpu) to the few factories that have them; cross-product queueing fairness.
- Load + chaos suite — factory churn, broker outage, thundering herd.
Phase 5 (learned routing): 8. Capture per-run outcome features → offline eval harness (learned vs heuristic) → shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to claude on mac-2: 23% faster").
14. Bugs, gaps & risks (be honest)
Documentation status (reconciled 2026-05-31):
GIGAFACTORY_ROADMAP.md§0 now reads Phase 0 ✅100% · 1 ✅~98% · 2 ✅~98% · 3 ✅100% · 4 ◐ in progress (~10%, M0 shipped) · 5 ☐. Phase-2 boxes for the scheduler core and factory enrollment/scoped tokens are ticked (scheduler.tsselectJob/selectPreemptionVictimwired intoclaimNextJob;enrollment.tsenforceFactoryTokengating claim/heartbeat). The earlier "stale §0 table" warning no longer applies.
Runtime / correctness gaps:
- SSE is poll-fallback based, not a push-only contract.
subscribeJobEventsfalls back togetJobEvents()polling on stream error — fine for resilience, but "live" can silently degrade to polling without a visible operator signal. - UI pages degrade silently on some errors (empty states /
null), which can mask a real backend outage as "nothing happening." - Budget page assumes
ceilingUsdexists when rendering the spend bar — a budget doc without a ceiling could render a broken/NaN bar. Guard it. - Dashboard
patchJobonly sends{stage, leaseEpoch}— other fenced-transition fields (e.g.checkpoint) aren't exposed in the web UI, so operator-driven transitions can't carry a checkpoint. revCAS on the memory provider is exact only for the sequential calls the coordinator/tests make (re-readrevbefore write). Real concurrency safety depends on Cosmos_etag/If-Matchin production — verify the Cosmos path under true contention before relying on it at scale.
TUI-specific (this repo):
- Fleet utilization % only renders in the metrics-aggregate fallback branch, not when per-factory rows are present — a minor inconsistency in the TUI board.
- The budget.wall live selftest is timing-sensitive (races a 2s wall ceiling) and can flake under heavy disk/CPU load; the code is correct but the test could be made more robust (e.g. inject the clock).
- TUI fleet mode has no write path for budgets/preemption — it's read + job actions only; budget pause/resume is web-only.
Operational gotchas (verified on the live fleet — get these right):
- Heartbeat cadence MUST be < the 90s stale threshold.
fleet_metricsmarks a factory stale afterDEFAULT_STALE_FACTORY_MS = 90_000, but the factory only heartbeats everyAQ_FLEET_LEASE_RENEW_SEC(default 300s). Left at the default, a healthy factory flaps to "stale"/"no live factory" between beats. The fleet launcher setsAQ_FLEET_LEASE_RENEW_SEC=30to stay well inside the window. - The tracker-web New-Job form is misconfigured: it hardcodes factories
mac-1/mac-2and defaultscapabilities=["build"]— a token no agent-queue factory advertises (detect_capabilitiesemitsos:*/engine:*/node:*/has:*). So a default UI submission is unroutable (queues forever →queue_starvation). Fix tracked in the redesign doc's routing-model section. - No factory deregister API. Only heartbeat/enroll/rotate/revoke exist, so a
dead factory's doc lingers and shows as
staleuntil pruned out-of-band (currently a manual Cosmos delete). A prune/deregister path is a Phase-4 item.
Not-yet-built (expected, Phase 4+):
- No message bus yet — dispatch is still poll-based, but the M0 RU gate now skips the claim while idle (so idle Cosmos RU is near-flat). Broker push/ backpressure is M1+.
- No autoscaling — factory fleet is static/manually run (M3 target).
- No capability marketplace / cross-product fairness under contention.
- No load/chaos test suite — resilience is unit-proven, not load-proven.
- Artifacts blob wiring (
fleet_artifacts→ real blob storage) should be validated against a live storage account (tests use memory/pointer only).
Recently fixed (2026-05-31):
run --oncecould return before a backgrounded worker finished the PR/report._meta_end(which writesended=) was called right after thetesting/move, before PR open/merge + coordinator reports, so the slot freed early and--oncecould exit (and a caller could observe completion) mid-PR. Nowended=is written last; the selftest PR-mode case is deterministic again.
15. TL;DR
Phases 0–3 are functionally complete and well-tested: a durable coordinator with
exactly-once leasing + fencing + crash recovery, a deterministic scoring router with
preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer
gate, factory enrollment with scoped tokens, and two control planes (browser +
TUI) over one /fleet API. The remaining work is (a) trivial doc corrections, (b)
CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier
(broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.