bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY_SYSTEM_OVERVIEW.md
Saravanakumar D 71e5ad6923 docs(gigafactory): add system overview with architecture diagrams; sync roadmap status
Add GIGAFACTORY_SYSTEM_OVERVIEW.md — a current-state companion to the roadmap
spec covering: what the Agent Gigafactory is, a completion snapshot, three
Mermaid diagrams (component architecture, job-lifecycle state machine, atomic
claim + lease-fencing sequence), the Cosmos data model, the scoring router,
subsystem map, full /fleet REST surface, feature flags, the two control planes,
a cross-repo code map, test coverage, next steps (Phase 4/5), and an honest
bugs/gaps/risks section. All three Mermaid blocks validated with mermaid.parse.

Also correct documentation drift in GIGAFACTORY_ROADMAP.md found during the
review:
- §0 progress table showed Phase 3 as "0% not started" while every Phase-3 box
  is ticked; updated phases 1-3 to done with realistic percentages.
- Phase-2 boxes "scheduler/router wired into assignment", "tracker adapter
  direct call", and "factory enrollment + scoped tokens" are implemented in
  common-plat (coordinator.ts uses selectJob; routes.ts enforces
  enrollment.enforceFactoryToken; tracker-bridge.ts) but were left unticked —
  ticked with evidence and refreshed the stale "remaining for 100%" notes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 20:11:02 -07:00

419 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Gigafactory — System Overview (current picture)
> Companion to `GIGAFACTORY_ROADMAP.md` (the source-of-truth spec & checklists).
> This document describes **what is actually built today**, how the pieces fit
> together, the architecture diagrams, the code map across both repos, the next
> steps, and the known bugs/gaps. Last reviewed: **2026-05-30**.
---
## 1. What it is (in one paragraph)
The **Agent Gigafactory** turns a single-host "folder queue" agent runner into a
**distributed fleet** of agent "factories" (machines: mac/ubuntu/windows) that
claim and execute coding jobs in parallel, coordinated by a durable,
product-agnostic service. A job is a markdown manifest (persona + capabilities +
budget + deps); the **coordinator** assigns each job to the best-fit factory via a
deterministic scoring router, guarantees **exactly-once assignment** through
optimistic-concurrency claims + **leases with epoch fencing**, recovers crashed
work automatically (reaper + WIP checkpoints), enforces **per-product budgets**,
supports **DAG decomposition** (composite → child jobs), and exposes the whole
fleet through **two control planes**: a browser UI (`tracker-web`) and a terminal
TUI (`agent-queue` dashboard). Both control planes talk to the same `/fleet` REST
API.
---
## 2. Completion snapshot (reality, not the stale table)
| Phase | Theme | Real status | Notes |
| ----- | ----- | ----------- | ----- |
| **0** | Single-host baseline | ✅ 100% | `agent-queue.sh` folder queue, selftest green |
| **1** | Manifest + profiles + capabilities + tracker adapter | ✅ ~98% | Only leftover: Node `dash` field surfacing — **now also done** via fleet-dash tags. Effectively complete |
| **2** | Coordinator module + Cosmos + multi-factory leasing | ✅ ~98% | Scheduler wiring, enrollment+tokens, tracker-bridge are **done in code** but boxes 384/386 unticked in roadmap (see §11 Gaps) |
| **3** | Fleet control plane (web + TUI) + DAG + budgets + scoring | ✅ 100% (all boxes ticked) | Pending: Playwright e2e wired into CI; live multi-host operator run |
| **4** | Message bus + autoscaling + capability marketplace | ☐ 0% | Not started — next major frontier |
| **5** | Self-optimizing / learned routing | ☐ 0% | Not started |
> ⚠️ The **`GIGAFACTORY_ROADMAP.md` §0 progress table is stale** — it shows
> Phase 3 as "0% not started" although every Phase-3 box below it is ticked. See
> §11 (Bugs & Gaps) — this should be corrected.
---
## 3. System architecture
```mermaid
graph TB
subgraph CP["Control planes (operators)"]
WEB["tracker-web Fleet UI<br/>(Next.js, /dashboard/fleet/*)"]
TUI["agent-queue TUI<br/>(dashboard.mjs, AQ_FLEET_DASH=1)"]
end
subgraph SVC["platform-service — fleet module (the spine)"]
ROUTES["routes.ts<br/>/fleet REST + SSE"]
COORD["coordinator.ts<br/>claim · lease · fence · reaper<br/>preemption · budgets · DAG · review"]
SCHED["scheduler.ts<br/>pure scoring router (§7)"]
ENROLL["enrollment.ts<br/>factory tokens (scoped, rotatable)"]
BRIDGE["tracker-bridge.ts<br/>job ↔ tracker item"]
ARTIF["artifacts.ts / artifacts-blob.ts<br/>pointer + blob bytes"]
REPO["repository.ts<br/>CAS (rev/_etag) CRUD"]
end
subgraph DATA["@bytelyst/datastore (Cosmos / memory)"]
JOBS[("fleet_jobs")]
RUNS[("fleet_runs")]
LEASES[("fleet_leases")]
FAC[("fleet_factories")]
PROF[("fleet_profiles")]
EVENTS[("fleet_events")]
ARTDOCS[("fleet_artifacts")]
end
subgraph FLEET["Factory agents (workers, N hosts)"]
F1["agent-queue.sh + lib/fleet-client.sh<br/>(AQ_FLEET=1) — mac-1"]
F2["agent-queue.sh + lib/fleet-client.sh<br/>ubuntu-1"]
ENGINES["engines: claude · codex · devin"]
end
WEB -->|/api/fleet proxy| ROUTES
TUI -->|lib/fleet-dash.mjs| ROUTES
ROUTES --> COORD
COORD --> SCHED
ROUTES --> ENROLL
ROUTES --> BRIDGE
ROUTES --> ARTIF
COORD --> REPO
ENROLL --> REPO
BRIDGE --> REPO
ARTIF --> ARTDOCS
REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS
F1 -->|heartbeat · claim · patch fenced · renew| ROUTES
F2 -->|heartbeat · claim · patch fenced · renew| ROUTES
F1 --> ENGINES
F2 --> ENGINES
```
**Layering principle:** `scheduler.ts` is **pure** (no I/O — all inputs passed
in), `coordinator.ts` is the orchestration core, `repository.ts` is the only thing
that touches the datastore, and `routes.ts` is the only thing that touches HTTP.
Factories never touch the DB directly — they only call REST.
---
## 4. Job lifecycle (stages)
```mermaid
stateDiagram-v2
[*] --> queued: submitJob
queued --> blocked: unmet deps
blocked --> queued: deps satisfied (reaper/unblock)
queued --> assigned: claimNextJob (CAS win + lease)
assigned --> building: factory starts (patch fenced)
building --> review: rc=0 → review gate
building --> testing: verify-pass (auto)
review --> testing: approve / requestReview quorum
testing --> shipped: ship (manual gate)
building --> failed: verify-fail / budget_exceeded / timeout
review --> failed: reject
assigned --> queued: lease expired (reaper, +epoch, keep checkpoint)
building --> queued: preempted (critical job, checkpoint + epoch bump)
failed --> queued: requeue (operator)
failed --> dead_letter: retries exhausted
shipped --> [*]
dead_letter --> [*]
```
Stages (`types.ts`): `queued · blocked · assigned · building · review · testing ·
shipped · failed · dead_letter`. The TUI/local board collapse these onto kanban
buckets (`inbox/building/review/testing/shipped/failed`) for parity.
---
## 5. The core guarantee — atomic claim + lease fencing
This is the heart of "no double-assignment, ever" and "a dead worker can never
corrupt a reassigned job."
```mermaid
sequenceDiagram
participant FA as Factory A
participant FB as Factory B
participant CO as coordinator
participant DB as fleet_jobs / fleet_leases
FA->>CO: POST /fleet/claim (caps)
FB->>CO: POST /fleet/claim (caps)
CO->>DB: selectJob() → job J (rev=5)
CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS)
DB-->>CO: A wins (rev→6, leaseEpoch=1)
CO->>DB: revUpdate J IF rev==5 (B's CAS)
DB-->>CO: conflict (B re-selects)
CO-->>FA: assigned J (leaseEpoch=1)
CO-->>FB: conflict → next job
Note over FA: A crashes mid-build
CO->>DB: reapExpiredLeases(): lease expired → J back to queued,<br/>leaseEpoch=2, checkpoint preserved
FB->>CO: claim → J (leaseEpoch=2)
FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1
CO-->>FA: 409 fenced (1 < 2) — rejected
```
- **CAS:** `repository.revUpdateJob/revUpdateLease` write only if stored `rev`
matches (Cosmos `_etag`/`If-Match`; memory provider re-reads `rev`).
- **Fencing:** every worker mutation carries `leaseEpoch`; epoch `< job.leaseEpoch`
`fenced` (409).
- **Reaper:** `reapExpiredLeases(now)` requeues expired-lease jobs, **bumps the
epoch**, and **keeps the `checkpoint`** (WIP git branch pointer) so work resumes
rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery.
---
## 6. Data model (Cosmos containers)
| Container | PK | Purpose |
| --------- | -- | ------- |
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `depsMode`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, `kind`, `parentId` |
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
| `fleet_factories` | `/productId` | worker host: `capabilities[]`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
| `fleet_profiles` | `/productId` | immutable, versioned persona/capability profile snapshot |
| `fleet_events` | `/jobId` | append-only audit stream (monotonic `seq`) — powers SSE |
| `fleet_artifacts` | `/jobId` | **pointers** to blob-stored artifacts (no inline logs) |
Every document carries `productId`. Containers registered in `lib/cosmos-init.ts`.
---
## 7. The scheduler / scoring router (`scheduler.ts`)
Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in
Phase 5). Filter → score → rank:
```
score = w1·capabilityFit + w2·affinity + w3·(1/(1+load))
+ w4·costFit(budget) + w5·health w6·starvationPenalty(age)
```
Default weights (`DEFAULT_WEIGHTS`): `capabilityFit 1.0 · affinity 0.5 · load 1.0
· costFit 0.75 · health 1.0 · starvation 1.5`. Capability is a **hard filter**
(subset check); `down` factories are filtered out, not scored; aging fully
de-penalises after ~30 min (anti-starvation). `scoreCandidate` returns a per-term
breakdown that powers the **explainability** panel (`GET /fleet/jobs/:id/explain`
`ExplainPanel`). `selectPreemptionVictim` picks the lowest-priority running job a
critical job may evict (under `FLEET_PREEMPTION`).
---
## 8. Subsystems at a glance
| Subsystem | File(s) | What it does | Flag |
| --------- | ------- | ------------ | ---- |
| Claim / lease / fence / reaper | `coordinator.ts` | exactly-once assignment, recovery | — |
| Scoring router + preemption | `scheduler.ts`, `coordinator.ts` | best-fit assignment, evict low-pri for critical | `FLEET_PREEMPTION` |
| Per-product budgets | `coordinator.ts` (`accrueSpend`, `pause/resume`) | ceiling + auto-pause kill-switch; burndown | `FLEET_BUDGETS` |
| DAG decomposition | `coordinator.ts` (`submitChildren`, `getDagSubtree`, `maybeUnblockParent`) | composite job fans out to children; deps gate parent | — |
| Review gate | `coordinator.ts` (`requestReview`, `submitReview`) | multi-reviewer quorum before ship | — |
| Factory enrollment | `enrollment.ts` | scoped, rotatable, hashed tokens; auth on claim/heartbeat | — |
| Tracker bridge | `tracker-bridge.ts` | idempotent ingest of tracker item → job; one-way status echo | — |
| Artifacts | `artifacts.ts`, `artifacts-blob.ts` | pointer docs in Cosmos, bytes in blob (SAS) | — |
| Live events | `routes.ts` SSE + `fleet_events` | `GET /fleet/jobs/:id/events/stream` | — |
| Metrics / alerts | `coordinator.ts` (`fleetMetrics`) | utilization, health rollup, starvation alerts | — |
---
## 9. REST API surface (`/fleet`, under `/api`, auth + `x-product-id`)
```
Jobs POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id
PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action
Claim POST /fleet/claim
Lease POST /fleet/jobs/:id/lease/renew · /lease/release
Factories POST /fleet/factories/heartbeat · /enroll
POST /fleet/factories/:id/token/rotate · /token/revoke
Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain
Review POST /fleet/jobs/:id/review/request · /review
Budgets GET /fleet/budgets/:productId · /burndown
PUT /fleet/budgets/:productId · POST /pause · /resume
DAG POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
Artifacts POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
Tracker POST /fleet/tracker/ingest · /fleet/tracker/echo
Metrics GET /fleet/metrics
```
---
## 10. The two control planes & feature flags
**Browser (`tracker-web`)**`dashboards/tracker-web/src/`:
- `app/dashboard/fleet/page.tsx` — fleet map (factory cards, health/load/caps, metrics + alerts)
- `app/dashboard/fleet/jobs/page.tsx` — stage-filtered job table
- `app/dashboard/fleet/jobs/[id]/page.tsx` — job detail: SSE event timeline, runs, artifacts, **DAG view**, **ExplainPanel**, **ReviewGateCard**, ship/requeue/reject
- `app/dashboard/fleet/budget/page.tsx` — burndown chart + pause/resume kill-switch
- `lib/fleet-client.ts` — typed client; `subscribeJobEvents` (fetch-based SSE w/ auth + `Last-Event-ID` resume + poll fallback); graceful 404 → null
- `app/api/fleet/[...path]/route.ts` — proxy to platform-service
**Terminal (`agent-queue`)**`learning_ai_devops_tools/agent-queue/`:
- `dashboard.mjs` (`AQ_FLEET_DASH=1`) → `lib/fleet-dash.mjs` adapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via `/fleet`. Local folder-queue mode byte-for-byte unchanged when the flag is off.
**Feature flags**
| Flag | Where | Effect |
| ---- | ----- | ------ |
| `FLEET_PREEMPTION` | platform-service | enable critical-job preemption + seat limits |
| `FLEET_BUDGETS` | platform-service | enable budget enforcement + auto-pause |
| `AQ_FLEET` | factory runner | runner becomes a coordinator factory (claim/report) |
| `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` | factory runner | route via service / side-effect-free shadow compare |
| `AQ_FLEET_DASH` | TUI | dashboard sources board from `/fleet` API |
| `AQ_FLEET_API` / `AQ_FLEET_TOKEN` / `AQ_PRODUCT_ID` | both | base URL / bearer / `x-product-id` |
All flags default **off** → the system is byte-for-byte the prior single-host tool.
---
## 11. Code map (where everything lives)
**`learning_ai_common_plat` (the durable spine):**
```
services/platform-service/src/modules/fleet/
types.ts Zod schemas + canonical model (stages, lease, budget, DAG, events)
repository.ts per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent
coordinator.ts submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics
scheduler.ts pure scoring router + selectPreemptionVictim + scoreCandidate (explain)
enrollment.ts factory enroll / rotate / revoke / enforceFactoryToken
tracker-bridge.ts ingest tracker item → job; one-way status echo
artifacts.ts artifact pointer mgmt
artifacts-blob.ts blob upload/download/delete (SAS)
routes.ts all /fleet REST + SSE
*.test.ts coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types
dashboards/tracker-web/src/
app/dashboard/fleet/** the browser control plane (pages above)
lib/fleet-client.ts typed client + SSE
app/api/fleet/[...path]/route.ts proxy
e2e/fleet.spec.ts Playwright specs
lib/cosmos-init.ts container registration
docs/gigafactory-phase3-progress.md / docs/FLEET_CONTROL_PLANE.md
```
**`learning_ai_devops_tools` (the factory agent + TUI + spec):**
```
agent-queue/
agent-queue.sh single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover
lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware)
lib/fleet-dash.mjs TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions)
dashboard.mjs the TUI (local + fleet modes)
profiles/*.md persona+capability catalog
demo/two-factory-demo.sh + coordinator-stub.sh parallel-fleet demo
selftest.sh ~75 dependency-light checks
docs/GIGAFACTORY_ROADMAP.md source-of-truth spec & checklists
docs/GIGAFACTORY_SYSTEM_OVERVIEW.md (this file)
```
---
## 12. Test coverage (what's verified)
- **platform-service fleet** (~134+ tests): atomic-claim race (true concurrency, no
double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring
/ tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree,
budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle +
auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain),
schema validation.
- **tracker-web** (~198 tests): fleet-client unit tests + page render; SSE
parse/resume/fallback; graceful 404 degradation.
- **tracker-web e2e** (`e2e/fleet.spec.ts`): fleet map, live log, ship, budget-pause,
review-gate (Playwright — needs CI wiring).
- **agent-queue** (`selftest.sh`, ~75 checks): manifest/profiles/caps/priority/deps/
idempotency, retry/recover/insights, tracker round-trip, `AQ_FLEET` register/claim/
fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo,
**budget.wall enforcement**, **fleet-dash adapter (22 assertions)**.
---
## 13. Next steps
**Immediate (close Phase 13 to a clean 100%):**
1. **Fix the stale roadmap §0 table** and tick Phase-2 boxes 384 (scheduler wired —
`selectJob` is used in `claimNextJob`) and 386 (enrollment + scoped tokens —
`enrollment.ts` + `enforceFactoryToken` are wired). (See §11 Gaps.)
2. **Wire `e2e/fleet.spec.ts` into CI** (Playwright install + a `verify` job) so the
Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just
present.
3. **Live multi-host operator run** end-to-end (the Phase-3 acceptance: drive the
3-repo parallel workload from the browser, including a budget pause + resume
against a real platform-service, not the stub).
**Phase 4 (scale-out) — the next major frontier:**
4. Introduce a **broker** (NATS/Redis) for push dispatch + backpressure; coordinator
publishes, factories subscribe by capability (fallback to poll on outage).
5. **Autoscaling hooks** — spin ephemeral factories (cloud VM/container) keyed to
queue depth + SLA.
6. **Capability marketplace** — route rare-capability jobs (xcode/figma/gpu) to the
few factories that have them; cross-product queueing fairness.
7. **Load + chaos suite** — factory churn, broker outage, thundering herd.
**Phase 5 (learned routing):**
8. Capture per-run outcome features → offline eval harness (learned vs heuristic) →
shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to
claude on mac-2: 23% faster").
---
## 14. Bugs, gaps & risks (be honest)
**Documentation drift (highest-signal, easy to fix):**
- `GIGAFACTORY_ROADMAP.md` **§0 progress table is wrong** — shows Phase 3 "0% not
started" while all Phase-3 boxes are ticked, and Phase 1/2 percentages (95%/80%)
understate reality.
- **Phase-2 boxes 384 & 386 are unticked but done in code.** `coordinator.ts`
imports/uses `selectJob` + `selectPreemptionVictim` in `claimNextJob`; `routes.ts`
enforces `enrollment.enforceFactoryToken` on claim/heartbeat and exposes
enroll/rotate/revoke. The roadmap's "remaining for 100%" note on line 390 is
outdated.
**Runtime / correctness gaps:**
- **SSE is poll-fallback based, not a push-only contract.** `subscribeJobEvents`
falls back to `getJobEvents()` polling on stream error — fine for resilience, but
"live" can silently degrade to polling without a visible operator signal.
- **UI pages degrade silently on some errors** (empty states / `null`), which can
mask a real backend outage as "nothing happening."
- **Budget page assumes `ceilingUsd` exists** when rendering the spend bar — a
budget doc without a ceiling could render a broken/NaN bar. Guard it.
- **Dashboard `patchJob` only sends `{stage, leaseEpoch}`** — other fenced-transition
fields (e.g. `checkpoint`) aren't exposed in the web UI, so operator-driven
transitions can't carry a checkpoint.
- **`rev` CAS on the memory provider** is exact only for the sequential calls the
coordinator/tests make (re-read `rev` before write). Real concurrency safety
depends on Cosmos `_etag`/`If-Match` in production — verify the Cosmos path under
true contention before relying on it at scale.
**TUI-specific (this repo):**
- Fleet **utilization %** only renders in the metrics-aggregate fallback branch, not
when per-factory rows are present — a minor inconsistency in the TUI board.
- The **budget.wall live selftest is timing-sensitive** (races a 2s wall ceiling) and
can flake under heavy disk/CPU load; the code is correct but the test could be made
more robust (e.g. inject the clock).
- TUI fleet mode has **no write path for budgets/preemption** — it's read + job
actions only; budget pause/resume is web-only.
**Operational / not-yet-built (expected, Phase 4+):**
- **No message bus** — dispatch is poll-based; no push/backpressure yet.
- **No autoscaling** — factory fleet is static/manually run.
- **No capability marketplace / cross-product fairness** under contention.
- **No load/chaos test suite** — resilience is unit-proven, not load-proven.
- **Artifacts blob wiring** (`fleet_artifacts` → real blob storage) should be
validated against a live storage account (tests use memory/pointer only).
---
## 15. TL;DR
Phases 03 are functionally **complete and well-tested**: a durable coordinator with
exactly-once leasing + fencing + crash recovery, a deterministic scoring router with
preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer
gate, factory enrollment with scoped tokens, and **two** control planes (browser +
TUI) over one `/fleet` API. The remaining work is (a) trivial doc corrections, (b)
CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier
(broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.