Add GIGAFACTORY_SYSTEM_OVERVIEW.md — a current-state companion to the roadmap spec covering: what the Agent Gigafactory is, a completion snapshot, three Mermaid diagrams (component architecture, job-lifecycle state machine, atomic claim + lease-fencing sequence), the Cosmos data model, the scoring router, subsystem map, full /fleet REST surface, feature flags, the two control planes, a cross-repo code map, test coverage, next steps (Phase 4/5), and an honest bugs/gaps/risks section. All three Mermaid blocks validated with mermaid.parse. Also correct documentation drift in GIGAFACTORY_ROADMAP.md found during the review: - §0 progress table showed Phase 3 as "0% not started" while every Phase-3 box is ticked; updated phases 1-3 to done with realistic percentages. - Phase-2 boxes "scheduler/router wired into assignment", "tracker adapter direct call", and "factory enrollment + scoped tokens" are implemented in common-plat (coordinator.ts uses selectJob; routes.ts enforces enrollment.enforceFactoryToken; tracker-bridge.ts) but were left unticked — ticked with evidence and refreshed the stale "remaining for 100%" notes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
20 KiB
Agent Gigafactory — System Overview (current picture)
Companion to
GIGAFACTORY_ROADMAP.md(the source-of-truth spec & checklists). This document describes what is actually built today, how the pieces fit together, the architecture diagrams, the code map across both repos, the next steps, and the known bugs/gaps. Last reviewed: 2026-05-30.
1. What it is (in one paragraph)
The Agent Gigafactory turns a single-host "folder queue" agent runner into a
distributed fleet of agent "factories" (machines: mac/ubuntu/windows) that
claim and execute coding jobs in parallel, coordinated by a durable,
product-agnostic service. A job is a markdown manifest (persona + capabilities +
budget + deps); the coordinator assigns each job to the best-fit factory via a
deterministic scoring router, guarantees exactly-once assignment through
optimistic-concurrency claims + leases with epoch fencing, recovers crashed
work automatically (reaper + WIP checkpoints), enforces per-product budgets,
supports DAG decomposition (composite → child jobs), and exposes the whole
fleet through two control planes: a browser UI (tracker-web) and a terminal
TUI (agent-queue dashboard). Both control planes talk to the same /fleet REST
API.
2. Completion snapshot (reality, not the stale table)
| Phase | Theme | Real status | Notes |
|---|---|---|---|
| 0 | Single-host baseline | ✅ 100% | agent-queue.sh folder queue, selftest green |
| 1 | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node dash field surfacing — now also done via fleet-dash tags. Effectively complete |
| 2 | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are done in code but boxes 384/386 unticked in roadmap (see §11 Gaps) |
| 3 | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run |
| 4 | Message bus + autoscaling + capability marketplace | ☐ 0% | Not started — next major frontier |
| 5 | Self-optimizing / learned routing | ☐ 0% | Not started |
⚠️ The
GIGAFACTORY_ROADMAP.md§0 progress table is stale — it shows Phase 3 as "0% not started" although every Phase-3 box below it is ticked. See §11 (Bugs & Gaps) — this should be corrected.
3. System architecture
graph TB
subgraph CP["Control planes (operators)"]
WEB["tracker-web Fleet UI<br/>(Next.js, /dashboard/fleet/*)"]
TUI["agent-queue TUI<br/>(dashboard.mjs, AQ_FLEET_DASH=1)"]
end
subgraph SVC["platform-service — fleet module (the spine)"]
ROUTES["routes.ts<br/>/fleet REST + SSE"]
COORD["coordinator.ts<br/>claim · lease · fence · reaper<br/>preemption · budgets · DAG · review"]
SCHED["scheduler.ts<br/>pure scoring router (§7)"]
ENROLL["enrollment.ts<br/>factory tokens (scoped, rotatable)"]
BRIDGE["tracker-bridge.ts<br/>job ↔ tracker item"]
ARTIF["artifacts.ts / artifacts-blob.ts<br/>pointer + blob bytes"]
REPO["repository.ts<br/>CAS (rev/_etag) CRUD"]
end
subgraph DATA["@bytelyst/datastore (Cosmos / memory)"]
JOBS[("fleet_jobs")]
RUNS[("fleet_runs")]
LEASES[("fleet_leases")]
FAC[("fleet_factories")]
PROF[("fleet_profiles")]
EVENTS[("fleet_events")]
ARTDOCS[("fleet_artifacts")]
end
subgraph FLEET["Factory agents (workers, N hosts)"]
F1["agent-queue.sh + lib/fleet-client.sh<br/>(AQ_FLEET=1) — mac-1"]
F2["agent-queue.sh + lib/fleet-client.sh<br/>ubuntu-1"]
ENGINES["engines: claude · codex · devin"]
end
WEB -->|/api/fleet proxy| ROUTES
TUI -->|lib/fleet-dash.mjs| ROUTES
ROUTES --> COORD
COORD --> SCHED
ROUTES --> ENROLL
ROUTES --> BRIDGE
ROUTES --> ARTIF
COORD --> REPO
ENROLL --> REPO
BRIDGE --> REPO
ARTIF --> ARTDOCS
REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS
F1 -->|heartbeat · claim · patch fenced · renew| ROUTES
F2 -->|heartbeat · claim · patch fenced · renew| ROUTES
F1 --> ENGINES
F2 --> ENGINES
Layering principle: scheduler.ts is pure (no I/O — all inputs passed
in), coordinator.ts is the orchestration core, repository.ts is the only thing
that touches the datastore, and routes.ts is the only thing that touches HTTP.
Factories never touch the DB directly — they only call REST.
4. Job lifecycle (stages)
stateDiagram-v2
[*] --> queued: submitJob
queued --> blocked: unmet deps
blocked --> queued: deps satisfied (reaper/unblock)
queued --> assigned: claimNextJob (CAS win + lease)
assigned --> building: factory starts (patch fenced)
building --> review: rc=0 → review gate
building --> testing: verify-pass (auto)
review --> testing: approve / requestReview quorum
testing --> shipped: ship (manual gate)
building --> failed: verify-fail / budget_exceeded / timeout
review --> failed: reject
assigned --> queued: lease expired (reaper, +epoch, keep checkpoint)
building --> queued: preempted (critical job, checkpoint + epoch bump)
failed --> queued: requeue (operator)
failed --> dead_letter: retries exhausted
shipped --> [*]
dead_letter --> [*]
Stages (types.ts): queued · blocked · assigned · building · review · testing · shipped · failed · dead_letter. The TUI/local board collapse these onto kanban
buckets (inbox/building/review/testing/shipped/failed) for parity.
5. The core guarantee — atomic claim + lease fencing
This is the heart of "no double-assignment, ever" and "a dead worker can never corrupt a reassigned job."
sequenceDiagram
participant FA as Factory A
participant FB as Factory B
participant CO as coordinator
participant DB as fleet_jobs / fleet_leases
FA->>CO: POST /fleet/claim (caps)
FB->>CO: POST /fleet/claim (caps)
CO->>DB: selectJob() → job J (rev=5)
CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS)
DB-->>CO: A wins (rev→6, leaseEpoch=1)
CO->>DB: revUpdate J IF rev==5 (B's CAS)
DB-->>CO: conflict (B re-selects)
CO-->>FA: assigned J (leaseEpoch=1)
CO-->>FB: conflict → next job
Note over FA: A crashes mid-build
CO->>DB: reapExpiredLeases(): lease expired → J back to queued,<br/>leaseEpoch=2, checkpoint preserved
FB->>CO: claim → J (leaseEpoch=2)
FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1
CO-->>FA: 409 fenced (1 < 2) — rejected
- CAS:
repository.revUpdateJob/revUpdateLeasewrite only if storedrevmatches (Cosmos_etag/If-Match; memory provider re-readsrev). - Fencing: every worker mutation carries
leaseEpoch; epoch< job.leaseEpoch⇒fenced(409). - Reaper:
reapExpiredLeases(now)requeues expired-lease jobs, bumps the epoch, and keeps thecheckpoint(WIP git branch pointer) so work resumes rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery.
6. Data model (Cosmos containers)
| Container | PK | Purpose |
|---|---|---|
fleet_jobs |
/productId |
durable job: manifestSnapshot, verbatim bodyMd, stage, idempotencyKey, deps, depsMode, checkpoint, priority, rev, leaseEpoch, kind, parentId |
fleet_runs |
/jobId |
one execution attempt: engine, timings, result, insights (tokens/cost/diff) |
fleet_leases |
/jobId |
single-holder lease: holderFactoryId, expiresAt, leaseEpoch, status |
fleet_factories |
/productId |
worker host: capabilities[], health, load, seatLimit, lastHeartbeatAt |
fleet_profiles |
/productId |
immutable, versioned persona/capability profile snapshot |
fleet_events |
/jobId |
append-only audit stream (monotonic seq) — powers SSE |
fleet_artifacts |
/jobId |
pointers to blob-stored artifacts (no inline logs) |
Every document carries productId. Containers registered in lib/cosmos-init.ts.
7. The scheduler / scoring router (scheduler.ts)
Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in Phase 5). Filter → score → rank:
score = w1·capabilityFit + w2·affinity + w3·(1/(1+load))
+ w4·costFit(budget) + w5·health − w6·starvationPenalty(age)
Default weights (DEFAULT_WEIGHTS): capabilityFit 1.0 · affinity 0.5 · load 1.0 · costFit 0.75 · health 1.0 · starvation 1.5. Capability is a hard filter
(subset check); down factories are filtered out, not scored; aging fully
de-penalises after ~30 min (anti-starvation). scoreCandidate returns a per-term
breakdown that powers the explainability panel (GET /fleet/jobs/:id/explain
→ ExplainPanel). selectPreemptionVictim picks the lowest-priority running job a
critical job may evict (under FLEET_PREEMPTION).
8. Subsystems at a glance
| Subsystem | File(s) | What it does | Flag |
|---|---|---|---|
| Claim / lease / fence / reaper | coordinator.ts |
exactly-once assignment, recovery | — |
| Scoring router + preemption | scheduler.ts, coordinator.ts |
best-fit assignment, evict low-pri for critical | FLEET_PREEMPTION |
| Per-product budgets | coordinator.ts (accrueSpend, pause/resume) |
ceiling + auto-pause kill-switch; burndown | FLEET_BUDGETS |
| DAG decomposition | coordinator.ts (submitChildren, getDagSubtree, maybeUnblockParent) |
composite job fans out to children; deps gate parent | — |
| Review gate | coordinator.ts (requestReview, submitReview) |
multi-reviewer quorum before ship | — |
| Factory enrollment | enrollment.ts |
scoped, rotatable, hashed tokens; auth on claim/heartbeat | — |
| Tracker bridge | tracker-bridge.ts |
idempotent ingest of tracker item → job; one-way status echo | — |
| Artifacts | artifacts.ts, artifacts-blob.ts |
pointer docs in Cosmos, bytes in blob (SAS) | — |
| Live events | routes.ts SSE + fleet_events |
GET /fleet/jobs/:id/events/stream |
— |
| Metrics / alerts | coordinator.ts (fleetMetrics) |
utilization, health rollup, starvation alerts | — |
9. REST API surface (/fleet, under /api, auth + x-product-id)
Jobs POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id
PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action
Claim POST /fleet/claim
Lease POST /fleet/jobs/:id/lease/renew · /lease/release
Factories POST /fleet/factories/heartbeat · /enroll
POST /fleet/factories/:id/token/rotate · /token/revoke
Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain
Review POST /fleet/jobs/:id/review/request · /review
Budgets GET /fleet/budgets/:productId · /burndown
PUT /fleet/budgets/:productId · POST /pause · /resume
DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo
Metrics GET /fleet/metrics
10. The two control planes & feature flags
Browser (tracker-web) — dashboards/tracker-web/src/:
app/dashboard/fleet/page.tsx— fleet map (factory cards, health/load/caps, metrics + alerts)app/dashboard/fleet/jobs/page.tsx— stage-filtered job tableapp/dashboard/fleet/jobs/[id]/page.tsx— job detail: SSE event timeline, runs, artifacts, DAG view, ExplainPanel, ReviewGateCard, ship/requeue/rejectapp/dashboard/fleet/budget/page.tsx— burndown chart + pause/resume kill-switchlib/fleet-client.ts— typed client;subscribeJobEvents(fetch-based SSE w/ auth +Last-Event-IDresume + poll fallback); graceful 404 → nullapp/api/fleet/[...path]/route.ts— proxy to platform-service
Terminal (agent-queue) — learning_ai_devops_tools/agent-queue/:
dashboard.mjs(AQ_FLEET_DASH=1) →lib/fleet-dash.mjsadapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via/fleet. Local folder-queue mode byte-for-byte unchanged when the flag is off.
Feature flags
| Flag | Where | Effect |
|---|---|---|
FLEET_PREEMPTION |
platform-service | enable critical-job preemption + seat limits |
FLEET_BUDGETS |
platform-service | enable budget enforcement + auto-pause |
AQ_FLEET |
factory runner | runner becomes a coordinator factory (claim/report) |
AQ_FLEET_ROUTE / AQ_FLEET_SHADOW |
factory runner | route via service / side-effect-free shadow compare |
AQ_FLEET_DASH |
TUI | dashboard sources board from /fleet API |
AQ_FLEET_API / AQ_FLEET_TOKEN / AQ_PRODUCT_ID |
both | base URL / bearer / x-product-id |
All flags default off → the system is byte-for-byte the prior single-host tool.
11. Code map (where everything lives)
learning_ai_common_plat (the durable spine):
services/platform-service/src/modules/fleet/
types.ts Zod schemas + canonical model (stages, lease, budget, DAG, events)
repository.ts per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent
coordinator.ts submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics
scheduler.ts pure scoring router + selectPreemptionVictim + scoreCandidate (explain)
enrollment.ts factory enroll / rotate / revoke / enforceFactoryToken
tracker-bridge.ts ingest tracker item → job; one-way status echo
artifacts.ts artifact pointer mgmt
artifacts-blob.ts blob upload/download/delete (SAS)
routes.ts all /fleet REST + SSE
*.test.ts coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types
dashboards/tracker-web/src/
app/dashboard/fleet/** the browser control plane (pages above)
lib/fleet-client.ts typed client + SSE
app/api/fleet/[...path]/route.ts proxy
e2e/fleet.spec.ts Playwright specs
lib/cosmos-init.ts container registration
docs/gigafactory-phase3-progress.md / docs/FLEET_CONTROL_PLANE.md
learning_ai_devops_tools (the factory agent + TUI + spec):
agent-queue/
agent-queue.sh single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover
lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware)
lib/fleet-dash.mjs TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions)
dashboard.mjs the TUI (local + fleet modes)
profiles/*.md persona+capability catalog
demo/two-factory-demo.sh + coordinator-stub.sh parallel-fleet demo
selftest.sh ~75 dependency-light checks
docs/GIGAFACTORY_ROADMAP.md source-of-truth spec & checklists
docs/GIGAFACTORY_SYSTEM_OVERVIEW.md (this file)
12. Test coverage (what's verified)
- platform-service fleet (~134+ tests): atomic-claim race (true concurrency, no double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring / tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree, budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle + auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain), schema validation.
- tracker-web (~198 tests): fleet-client unit tests + page render; SSE parse/resume/fallback; graceful 404 degradation.
- tracker-web e2e (
e2e/fleet.spec.ts): fleet map, live log, ship, budget-pause, review-gate (Playwright — needs CI wiring). - agent-queue (
selftest.sh, ~75 checks): manifest/profiles/caps/priority/deps/ idempotency, retry/recover/insights, tracker round-trip,AQ_FLEETregister/claim/ fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo, budget.wall enforcement, fleet-dash adapter (22 assertions).
13. Next steps
Immediate (close Phase 1–3 to a clean 100%):
- Fix the stale roadmap §0 table and tick Phase-2 boxes 384 (scheduler wired —
selectJobis used inclaimNextJob) and 386 (enrollment + scoped tokens —enrollment.ts+enforceFactoryTokenare wired). (See §11 Gaps.) - Wire
e2e/fleet.spec.tsinto CI (Playwright install + averifyjob) so the Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just present. - Live multi-host operator run end-to-end (the Phase-3 acceptance: drive the 3-repo parallel workload from the browser, including a budget pause + resume against a real platform-service, not the stub).
Phase 4 (scale-out) — the next major frontier: 4. Introduce a broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability (fallback to poll on outage). 5. Autoscaling hooks — spin ephemeral factories (cloud VM/container) keyed to queue depth + SLA. 6. Capability marketplace — route rare-capability jobs (xcode/figma/gpu) to the few factories that have them; cross-product queueing fairness. 7. Load + chaos suite — factory churn, broker outage, thundering herd.
Phase 5 (learned routing): 8. Capture per-run outcome features → offline eval harness (learned vs heuristic) → shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to claude on mac-2: 23% faster").
14. Bugs, gaps & risks (be honest)
Documentation drift (highest-signal, easy to fix):
GIGAFACTORY_ROADMAP.md§0 progress table is wrong — shows Phase 3 "0% not started" while all Phase-3 boxes are ticked, and Phase 1/2 percentages (95%/80%) understate reality.- Phase-2 boxes 384 & 386 are unticked but done in code.
coordinator.tsimports/usesselectJob+selectPreemptionVictiminclaimNextJob;routes.tsenforcesenrollment.enforceFactoryTokenon claim/heartbeat and exposes enroll/rotate/revoke. The roadmap's "remaining for 100%" note on line 390 is outdated.
Runtime / correctness gaps:
- SSE is poll-fallback based, not a push-only contract.
subscribeJobEventsfalls back togetJobEvents()polling on stream error — fine for resilience, but "live" can silently degrade to polling without a visible operator signal. - UI pages degrade silently on some errors (empty states /
null), which can mask a real backend outage as "nothing happening." - Budget page assumes
ceilingUsdexists when rendering the spend bar — a budget doc without a ceiling could render a broken/NaN bar. Guard it. - Dashboard
patchJobonly sends{stage, leaseEpoch}— other fenced-transition fields (e.g.checkpoint) aren't exposed in the web UI, so operator-driven transitions can't carry a checkpoint. revCAS on the memory provider is exact only for the sequential calls the coordinator/tests make (re-readrevbefore write). Real concurrency safety depends on Cosmos_etag/If-Matchin production — verify the Cosmos path under true contention before relying on it at scale.
TUI-specific (this repo):
- Fleet utilization % only renders in the metrics-aggregate fallback branch, not when per-factory rows are present — a minor inconsistency in the TUI board.
- The budget.wall live selftest is timing-sensitive (races a 2s wall ceiling) and can flake under heavy disk/CPU load; the code is correct but the test could be made more robust (e.g. inject the clock).
- TUI fleet mode has no write path for budgets/preemption — it's read + job actions only; budget pause/resume is web-only.
Operational / not-yet-built (expected, Phase 4+):
- No message bus — dispatch is poll-based; no push/backpressure yet.
- No autoscaling — factory fleet is static/manually run.
- No capability marketplace / cross-product fairness under contention.
- No load/chaos test suite — resilience is unit-proven, not load-proven.
- Artifacts blob wiring (
fleet_artifacts→ real blob storage) should be validated against a live storage account (tests use memory/pointer only).
15. TL;DR
Phases 0–3 are functionally complete and well-tested: a durable coordinator with
exactly-once leasing + fencing + crash recovery, a deterministic scoring router with
preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer
gate, factory enrollment with scoped tokens, and two control planes (browser +
TUI) over one /fleet API. The remaining work is (a) trivial doc corrections, (b)
CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier
(broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.