bytelyst-devops-tools/agent-queue/docs/gigafactory/GIGAFACTORY_SYSTEM_OVERVIEW.md
Saravanakumar D 257efcb4bc docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:01:23 -07:00

20 KiB
Raw Blame History

Agent Gigafactory — System Overview (current picture)

Companion to GIGAFACTORY_ROADMAP.md (the source-of-truth spec & checklists). This document describes what is actually built today, how the pieces fit together, the architecture diagrams, the code map across both repos, the next steps, and the known bugs/gaps. Last reviewed: 2026-05-30.


1. What it is (in one paragraph)

The Agent Gigafactory turns a single-host "folder queue" agent runner into a distributed fleet of agent "factories" (machines: mac/ubuntu/windows) that claim and execute coding jobs in parallel, coordinated by a durable, product-agnostic service. A job is a markdown manifest (persona + capabilities + budget + deps); the coordinator assigns each job to the best-fit factory via a deterministic scoring router, guarantees exactly-once assignment through optimistic-concurrency claims + leases with epoch fencing, recovers crashed work automatically (reaper + WIP checkpoints), enforces per-product budgets, supports DAG decomposition (composite → child jobs), and exposes the whole fleet through two control planes: a browser UI (tracker-web) and a terminal TUI (agent-queue dashboard). Both control planes talk to the same /fleet REST API.


2. Completion snapshot (reality, not the stale table)

Phase Theme Real status Notes
0 Single-host baseline 100% agent-queue.sh folder queue, selftest green
1 Manifest + profiles + capabilities + tracker adapter ~98% Only leftover: Node dash field surfacing — now also done via fleet-dash tags. Effectively complete
2 Coordinator module + Cosmos + multi-factory leasing ~98% Scheduler wiring, enrollment+tokens, tracker-bridge are done in code but boxes 384/386 unticked in roadmap (see §11 Gaps)
3 Fleet control plane (web + TUI) + DAG + budgets + scoring 100% (all boxes ticked) Pending: Playwright e2e wired into CI; live multi-host operator run
4 Message bus + autoscaling + capability marketplace ☐ 0% Not started — next major frontier
5 Self-optimizing / learned routing ☐ 0% Not started

⚠️ The GIGAFACTORY_ROADMAP.md §0 progress table is stale — it shows Phase 3 as "0% not started" although every Phase-3 box below it is ticked. See §11 (Bugs & Gaps) — this should be corrected.


3. System architecture

graph TB
  subgraph CP["Control planes (operators)"]
    WEB["tracker-web Fleet UI<br/>(Next.js, /dashboard/fleet/*)"]
    TUI["agent-queue TUI<br/>(dashboard.mjs, AQ_FLEET_DASH=1)"]
  end

  subgraph SVC["platform-service — fleet module (the spine)"]
    ROUTES["routes.ts<br/>/fleet REST + SSE"]
    COORD["coordinator.ts<br/>claim · lease · fence · reaper<br/>preemption · budgets · DAG · review"]
    SCHED["scheduler.ts<br/>pure scoring router (§7)"]
    ENROLL["enrollment.ts<br/>factory tokens (scoped, rotatable)"]
    BRIDGE["tracker-bridge.ts<br/>job ↔ tracker item"]
    ARTIF["artifacts.ts / artifacts-blob.ts<br/>pointer + blob bytes"]
    REPO["repository.ts<br/>CAS (rev/_etag) CRUD"]
  end

  subgraph DATA["@bytelyst/datastore (Cosmos / memory)"]
    JOBS[("fleet_jobs")]
    RUNS[("fleet_runs")]
    LEASES[("fleet_leases")]
    FAC[("fleet_factories")]
    PROF[("fleet_profiles")]
    EVENTS[("fleet_events")]
    ARTDOCS[("fleet_artifacts")]
  end

  subgraph FLEET["Factory agents (workers, N hosts)"]
    F1["agent-queue.sh + lib/fleet-client.sh<br/>(AQ_FLEET=1) — mac-1"]
    F2["agent-queue.sh + lib/fleet-client.sh<br/>ubuntu-1"]
    ENGINES["engines: claude · codex · devin"]
  end

  WEB -->|/api/fleet proxy| ROUTES
  TUI -->|lib/fleet-dash.mjs| ROUTES
  ROUTES --> COORD
  COORD --> SCHED
  ROUTES --> ENROLL
  ROUTES --> BRIDGE
  ROUTES --> ARTIF
  COORD --> REPO
  ENROLL --> REPO
  BRIDGE --> REPO
  ARTIF --> ARTDOCS
  REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS

  F1 -->|heartbeat · claim · patch fenced · renew| ROUTES
  F2 -->|heartbeat · claim · patch fenced · renew| ROUTES
  F1 --> ENGINES
  F2 --> ENGINES

Layering principle: scheduler.ts is pure (no I/O — all inputs passed in), coordinator.ts is the orchestration core, repository.ts is the only thing that touches the datastore, and routes.ts is the only thing that touches HTTP. Factories never touch the DB directly — they only call REST.


4. Job lifecycle (stages)

stateDiagram-v2
  [*] --> queued: submitJob
  queued --> blocked: unmet deps
  blocked --> queued: deps satisfied (reaper/unblock)
  queued --> assigned: claimNextJob (CAS win + lease)
  assigned --> building: factory starts (patch fenced)
  building --> review: rc=0 → review gate
  building --> testing: verify-pass (auto)
  review --> testing: approve / requestReview quorum
  testing --> shipped: ship (manual gate)
  building --> failed: verify-fail / budget_exceeded / timeout
  review --> failed: reject
  assigned --> queued: lease expired (reaper, +epoch, keep checkpoint)
  building --> queued: preempted (critical job, checkpoint + epoch bump)
  failed --> queued: requeue (operator)
  failed --> dead_letter: retries exhausted
  shipped --> [*]
  dead_letter --> [*]

Stages (types.ts): queued · blocked · assigned · building · review · testing · shipped · failed · dead_letter. The TUI/local board collapse these onto kanban buckets (inbox/building/review/testing/shipped/failed) for parity.


5. The core guarantee — atomic claim + lease fencing

This is the heart of "no double-assignment, ever" and "a dead worker can never corrupt a reassigned job."

sequenceDiagram
  participant FA as Factory A
  participant FB as Factory B
  participant CO as coordinator
  participant DB as fleet_jobs / fleet_leases

  FA->>CO: POST /fleet/claim (caps)
  FB->>CO: POST /fleet/claim (caps)
  CO->>DB: selectJob() → job J (rev=5)
  CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS)
  DB-->>CO: A wins (rev→6, leaseEpoch=1)
  CO->>DB: revUpdate J IF rev==5 (B's CAS)
  DB-->>CO: conflict (B re-selects)
  CO-->>FA: assigned J (leaseEpoch=1)
  CO-->>FB: conflict → next job

  Note over FA: A crashes mid-build
  CO->>DB: reapExpiredLeases(): lease expired → J back to queued,<br/>leaseEpoch=2, checkpoint preserved
  FB->>CO: claim → J (leaseEpoch=2)
  FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1
  CO-->>FA: 409 fenced (1 < 2) — rejected
  • CAS: repository.revUpdateJob/revUpdateLease write only if stored rev matches (Cosmos _etag/If-Match; memory provider re-reads rev).
  • Fencing: every worker mutation carries leaseEpoch; epoch < job.leaseEpochfenced (409).
  • Reaper: reapExpiredLeases(now) requeues expired-lease jobs, bumps the epoch, and keeps the checkpoint (WIP git branch pointer) so work resumes rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery.

6. Data model (Cosmos containers)

Container PK Purpose
fleet_jobs /productId durable job: manifestSnapshot, verbatim bodyMd, stage, idempotencyKey, deps, depsMode, checkpoint, priority, rev, leaseEpoch, kind, parentId
fleet_runs /jobId one execution attempt: engine, timings, result, insights (tokens/cost/diff)
fleet_leases /jobId single-holder lease: holderFactoryId, expiresAt, leaseEpoch, status
fleet_factories /productId worker host: capabilities[], health, load, seatLimit, lastHeartbeatAt
fleet_profiles /productId immutable, versioned persona/capability profile snapshot
fleet_events /jobId append-only audit stream (monotonic seq) — powers SSE
fleet_artifacts /jobId pointers to blob-stored artifacts (no inline logs)

Every document carries productId. Containers registered in lib/cosmos-init.ts.


7. The scheduler / scoring router (scheduler.ts)

Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in Phase 5). Filter → score → rank:

score = w1·capabilityFit + w2·affinity + w3·(1/(1+load))
      + w4·costFit(budget) + w5·health  w6·starvationPenalty(age)

Default weights (DEFAULT_WEIGHTS): capabilityFit 1.0 · affinity 0.5 · load 1.0 · costFit 0.75 · health 1.0 · starvation 1.5. Capability is a hard filter (subset check); down factories are filtered out, not scored; aging fully de-penalises after ~30 min (anti-starvation). scoreCandidate returns a per-term breakdown that powers the explainability panel (GET /fleet/jobs/:id/explainExplainPanel). selectPreemptionVictim picks the lowest-priority running job a critical job may evict (under FLEET_PREEMPTION).


8. Subsystems at a glance

Subsystem File(s) What it does Flag
Claim / lease / fence / reaper coordinator.ts exactly-once assignment, recovery
Scoring router + preemption scheduler.ts, coordinator.ts best-fit assignment, evict low-pri for critical FLEET_PREEMPTION
Per-product budgets coordinator.ts (accrueSpend, pause/resume) ceiling + auto-pause kill-switch; burndown FLEET_BUDGETS
DAG decomposition coordinator.ts (submitChildren, getDagSubtree, maybeUnblockParent) composite job fans out to children; deps gate parent
Review gate coordinator.ts (requestReview, submitReview) multi-reviewer quorum before ship
Factory enrollment enrollment.ts scoped, rotatable, hashed tokens; auth on claim/heartbeat
Tracker bridge tracker-bridge.ts idempotent ingest of tracker item → job; one-way status echo
Artifacts artifacts.ts, artifacts-blob.ts pointer docs in Cosmos, bytes in blob (SAS)
Live events routes.ts SSE + fleet_events GET /fleet/jobs/:id/events/stream
Metrics / alerts coordinator.ts (fleetMetrics) utilization, health rollup, starvation alerts

9. REST API surface (/fleet, under /api, auth + x-product-id)

Jobs       POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id
           PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action
Claim      POST /fleet/claim
Lease      POST /fleet/jobs/:id/lease/renew · /lease/release
Factories  POST /fleet/factories/heartbeat · /enroll
           POST /fleet/factories/:id/token/rotate · /token/revoke
Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain
Review     POST /fleet/jobs/:id/review/request · /review
Budgets    GET /fleet/budgets/:productId · /burndown
           PUT /fleet/budgets/:productId · POST /pause · /resume
DAG        POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
Artifacts  POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
Tracker    POST /fleet/tracker/ingest · /fleet/tracker/echo
Metrics    GET /fleet/metrics

10. The two control planes & feature flags

Browser (tracker-web)dashboards/tracker-web/src/:

  • app/dashboard/fleet/page.tsx — fleet map (factory cards, health/load/caps, metrics + alerts)
  • app/dashboard/fleet/jobs/page.tsx — stage-filtered job table
  • app/dashboard/fleet/jobs/[id]/page.tsx — job detail: SSE event timeline, runs, artifacts, DAG view, ExplainPanel, ReviewGateCard, ship/requeue/reject
  • app/dashboard/fleet/budget/page.tsx — burndown chart + pause/resume kill-switch
  • lib/fleet-client.ts — typed client; subscribeJobEvents (fetch-based SSE w/ auth + Last-Event-ID resume + poll fallback); graceful 404 → null
  • app/api/fleet/[...path]/route.ts — proxy to platform-service

Terminal (agent-queue)learning_ai_devops_tools/agent-queue/:

  • dashboard.mjs (AQ_FLEET_DASH=1) → lib/fleet-dash.mjs adapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via /fleet. Local folder-queue mode byte-for-byte unchanged when the flag is off.

Feature flags

Flag Where Effect
FLEET_PREEMPTION platform-service enable critical-job preemption + seat limits
FLEET_BUDGETS platform-service enable budget enforcement + auto-pause
AQ_FLEET factory runner runner becomes a coordinator factory (claim/report)
AQ_FLEET_ROUTE / AQ_FLEET_SHADOW factory runner route via service / side-effect-free shadow compare
AQ_FLEET_DASH TUI dashboard sources board from /fleet API
AQ_FLEET_API / AQ_FLEET_TOKEN / AQ_PRODUCT_ID both base URL / bearer / x-product-id

All flags default off → the system is byte-for-byte the prior single-host tool.


11. Code map (where everything lives)

learning_ai_common_plat (the durable spine):

services/platform-service/src/modules/fleet/
  types.ts            Zod schemas + canonical model (stages, lease, budget, DAG, events)
  repository.ts       per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent
  coordinator.ts      submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics
  scheduler.ts        pure scoring router + selectPreemptionVictim + scoreCandidate (explain)
  enrollment.ts       factory enroll / rotate / revoke / enforceFactoryToken
  tracker-bridge.ts   ingest tracker item → job; one-way status echo
  artifacts.ts        artifact pointer mgmt
  artifacts-blob.ts   blob upload/download/delete (SAS)
  routes.ts           all /fleet REST + SSE
  *.test.ts           coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types
dashboards/tracker-web/src/
  app/dashboard/fleet/**          the browser control plane (pages above)
  lib/fleet-client.ts             typed client + SSE
  app/api/fleet/[...path]/route.ts proxy
  e2e/fleet.spec.ts               Playwright specs
lib/cosmos-init.ts                container registration
docs/gigafactory/gigafactory-phase3-progress.md / docs/gigafactory/FLEET_CONTROL_PLANE.md

learning_ai_devops_tools (the factory agent + TUI + spec):

agent-queue/
  agent-queue.sh      single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover
  lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware)
  lib/fleet-dash.mjs  TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions)
  dashboard.mjs       the TUI (local + fleet modes)
  profiles/*.md       persona+capability catalog
  demo/two-factory-demo.sh + coordinator-stub.sh  parallel-fleet demo
  selftest.sh         ~75 dependency-light checks
  docs/gigafactory/GIGAFACTORY_ROADMAP.md   source-of-truth spec & checklists
  docs/gigafactory/GIGAFACTORY_SYSTEM_OVERVIEW.md   (this file)

12. Test coverage (what's verified)

  • platform-service fleet (~134+ tests): atomic-claim race (true concurrency, no double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring / tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree, budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle + auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain), schema validation.
  • tracker-web (~198 tests): fleet-client unit tests + page render; SSE parse/resume/fallback; graceful 404 degradation.
  • tracker-web e2e (e2e/fleet.spec.ts): fleet map, live log, ship, budget-pause, review-gate (Playwright — needs CI wiring).
  • agent-queue (selftest.sh, ~75 checks): manifest/profiles/caps/priority/deps/ idempotency, retry/recover/insights, tracker round-trip, AQ_FLEET register/claim/ fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo, budget.wall enforcement, fleet-dash adapter (22 assertions).

13. Next steps

Immediate (close Phase 13 to a clean 100%):

  1. Fix the stale roadmap §0 table and tick Phase-2 boxes 384 (scheduler wired — selectJob is used in claimNextJob) and 386 (enrollment + scoped tokens — enrollment.ts + enforceFactoryToken are wired). (See §11 Gaps.)
  2. Wire e2e/fleet.spec.ts into CI (Playwright install + a verify job) so the Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just present.
  3. Live multi-host operator run end-to-end (the Phase-3 acceptance: drive the 3-repo parallel workload from the browser, including a budget pause + resume against a real platform-service, not the stub).

Phase 4 (scale-out) — the next major frontier: 4. Introduce a broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability (fallback to poll on outage). 5. Autoscaling hooks — spin ephemeral factories (cloud VM/container) keyed to queue depth + SLA. 6. Capability marketplace — route rare-capability jobs (xcode/figma/gpu) to the few factories that have them; cross-product queueing fairness. 7. Load + chaos suite — factory churn, broker outage, thundering herd.

Phase 5 (learned routing): 8. Capture per-run outcome features → offline eval harness (learned vs heuristic) → shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to claude on mac-2: 23% faster").


14. Bugs, gaps & risks (be honest)

Documentation drift (highest-signal, easy to fix):

  • GIGAFACTORY_ROADMAP.md §0 progress table is wrong — shows Phase 3 "0% not started" while all Phase-3 boxes are ticked, and Phase 1/2 percentages (95%/80%) understate reality.
  • Phase-2 boxes 384 & 386 are unticked but done in code. coordinator.ts imports/uses selectJob + selectPreemptionVictim in claimNextJob; routes.ts enforces enrollment.enforceFactoryToken on claim/heartbeat and exposes enroll/rotate/revoke. The roadmap's "remaining for 100%" note on line 390 is outdated.

Runtime / correctness gaps:

  • SSE is poll-fallback based, not a push-only contract. subscribeJobEvents falls back to getJobEvents() polling on stream error — fine for resilience, but "live" can silently degrade to polling without a visible operator signal.
  • UI pages degrade silently on some errors (empty states / null), which can mask a real backend outage as "nothing happening."
  • Budget page assumes ceilingUsd exists when rendering the spend bar — a budget doc without a ceiling could render a broken/NaN bar. Guard it.
  • Dashboard patchJob only sends {stage, leaseEpoch} — other fenced-transition fields (e.g. checkpoint) aren't exposed in the web UI, so operator-driven transitions can't carry a checkpoint.
  • rev CAS on the memory provider is exact only for the sequential calls the coordinator/tests make (re-read rev before write). Real concurrency safety depends on Cosmos _etag/If-Match in production — verify the Cosmos path under true contention before relying on it at scale.

TUI-specific (this repo):

  • Fleet utilization % only renders in the metrics-aggregate fallback branch, not when per-factory rows are present — a minor inconsistency in the TUI board.
  • The budget.wall live selftest is timing-sensitive (races a 2s wall ceiling) and can flake under heavy disk/CPU load; the code is correct but the test could be made more robust (e.g. inject the clock).
  • TUI fleet mode has no write path for budgets/preemption — it's read + job actions only; budget pause/resume is web-only.

Operational / not-yet-built (expected, Phase 4+):

  • No message bus — dispatch is poll-based; no push/backpressure yet.
  • No autoscaling — factory fleet is static/manually run.
  • No capability marketplace / cross-product fairness under contention.
  • No load/chaos test suite — resilience is unit-proven, not load-proven.
  • Artifacts blob wiring (fleet_artifacts → real blob storage) should be validated against a live storage account (tests use memory/pointer only).

15. TL;DR

Phases 03 are functionally complete and well-tested: a durable coordinator with exactly-once leasing + fencing + crash recovery, a deterministic scoring router with preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer gate, factory enrollment with scoped tokens, and two control planes (browser + TUI) over one /fleet API. The remaining work is (a) trivial doc corrections, (b) CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier (broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.