bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md
saravanakumardb1 2993994273 docs(gigafactory): reconcile overview + roadmap to current reality
- System overview: mark Phase 4 in-progress (M0 RU gate shipped), add
  fleet_queue_state container + GET /fleet/queue-state, document the heartbeat
  cadence vs 90s stale gotcha, the tracker-web caps=build form bug, the missing
  deregister API, and the ended=-race fix; drop the now-false "roadmap §0 stale"
  and "boxes 384/386 unticked" claims (both reconciled); link the redesign doc.
- Roadmap: §0 Phase 4 -> in progress (M0); align the Phase-2 §8 spec endpoint
  sketches to the as-built API (/fleet/factories/enroll, /factories/heartbeat,
  /fleet/claim) + note the heartbeat cadence and the M0 gate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:02:45 -07:00

23 KiB
Raw Blame History

Agent Gigafactory — System Overview (current picture)

Companion to GIGAFACTORY_ROADMAP.md (the source-of-truth spec & checklists). This document describes what is actually built today, how the pieces fit together, the architecture diagrams, the code map across both repos, the next steps, and the known bugs/gaps. Last reviewed: 2026-05-31.

The Phase-4 plan + the as-built M0 RU gate live in FLEET_DISPATCH_REDESIGN.md — read it for the broker-backed dispatch design and the migration checklist.


1. What it is (in one paragraph)

The Agent Gigafactory turns a single-host "folder queue" agent runner into a distributed fleet of agent "factories" (machines: mac/ubuntu/windows) that claim and execute coding jobs in parallel, coordinated by a durable, product-agnostic service. A job is a markdown manifest (persona + capabilities + budget + deps); the coordinator assigns each job to the best-fit factory via a deterministic scoring router, guarantees exactly-once assignment through optimistic-concurrency claims + leases with epoch fencing, recovers crashed work automatically (reaper + WIP checkpoints), enforces per-product budgets, supports DAG decomposition (composite → child jobs), and exposes the whole fleet through two control planes: a browser UI (tracker-web) and a terminal TUI (agent-queue dashboard). Both control planes talk to the same /fleet REST API.


2. Completion snapshot (reality, not the stale table)

Phase Theme Real status Notes
0 Single-host baseline 100% agent-queue.sh folder queue, selftest green
1 Manifest + profiles + capabilities + tracker adapter ~98% Only leftover: Node dash field surfacing — now also done via fleet-dash tags. Effectively complete
2 Coordinator module + Cosmos + multi-factory leasing ~98% Scheduler wiring, enrollment+tokens, tracker-bridge are done in code but boxes 384/386 unticked in roadmap (see §11 Gaps)
3 Fleet control plane (web + TUI) + DAG + budgets + scoring 100% (all boxes ticked) Pending: Playwright e2e wired into CI; live multi-host operator run
4 Message bus + autoscaling + capability marketplace 🟡 in progress M0 (RU gate) shipped — see below. Broker (M1+) not started. Plan: FLEET_DISPATCH_REDESIGN.md
5 Self-optimizing / learned routing ☐ 0% Not started

Phase-4 M0 (RU gate) is live (2026-05-31): a per-product fleet_queue_state doc holds a monotonic version (bumped on job create + every stage change); factories with AQ_FLEET_GATE=1 point-read GET /fleet/queue-state (~1 RU) and skip the expensive claim while nothing changed — cutting idle Cosmos RU without raising the local poll interval. Default OFF; the live fleet runs it on.


3. System architecture

graph TB
  subgraph CP["Control planes (operators)"]
    WEB["tracker-web Fleet UI<br/>(Next.js, /dashboard/fleet/*)"]
    TUI["agent-queue TUI<br/>(dashboard.mjs, AQ_FLEET_DASH=1)"]
  end

  subgraph SVC["platform-service — fleet module (the spine)"]
    ROUTES["routes.ts<br/>/fleet REST + SSE"]
    COORD["coordinator.ts<br/>claim · lease · fence · reaper<br/>preemption · budgets · DAG · review"]
    SCHED["scheduler.ts<br/>pure scoring router (§7)"]
    ENROLL["enrollment.ts<br/>factory tokens (scoped, rotatable)"]
    BRIDGE["tracker-bridge.ts<br/>job ↔ tracker item"]
    ARTIF["artifacts.ts / artifacts-blob.ts<br/>pointer + blob bytes"]
    REPO["repository.ts<br/>CAS (rev/_etag) CRUD"]
  end

  subgraph DATA["@bytelyst/datastore (Cosmos / memory)"]
    JOBS[("fleet_jobs")]
    RUNS[("fleet_runs")]
    LEASES[("fleet_leases")]
    FAC[("fleet_factories")]
    PROF[("fleet_profiles")]
    EVENTS[("fleet_events")]
    ARTDOCS[("fleet_artifacts")]
  end

  subgraph FLEET["Factory agents (workers, N hosts)"]
    F1["agent-queue.sh + lib/fleet-client.sh<br/>(AQ_FLEET=1) — mac-1"]
    F2["agent-queue.sh + lib/fleet-client.sh<br/>ubuntu-1"]
    ENGINES["engines: claude · codex · devin"]
  end

  WEB -->|/api/fleet proxy| ROUTES
  TUI -->|lib/fleet-dash.mjs| ROUTES
  ROUTES --> COORD
  COORD --> SCHED
  ROUTES --> ENROLL
  ROUTES --> BRIDGE
  ROUTES --> ARTIF
  COORD --> REPO
  ENROLL --> REPO
  BRIDGE --> REPO
  ARTIF --> ARTDOCS
  REPO --> JOBS & RUNS & LEASES & FAC & PROF & EVENTS

  F1 -->|heartbeat · claim · patch fenced · renew| ROUTES
  F2 -->|heartbeat · claim · patch fenced · renew| ROUTES
  F1 --> ENGINES
  F2 --> ENGINES

Layering principle: scheduler.ts is pure (no I/O — all inputs passed in), coordinator.ts is the orchestration core, repository.ts is the only thing that touches the datastore, and routes.ts is the only thing that touches HTTP. Factories never touch the DB directly — they only call REST.


4. Job lifecycle (stages)

stateDiagram-v2
  [*] --> queued: submitJob
  queued --> blocked: unmet deps
  blocked --> queued: deps satisfied (reaper/unblock)
  queued --> assigned: claimNextJob (CAS win + lease)
  assigned --> building: factory starts (patch fenced)
  building --> review: rc=0 → review gate
  building --> testing: verify-pass (auto)
  review --> testing: approve / requestReview quorum
  testing --> shipped: ship (manual gate)
  building --> failed: verify-fail / budget_exceeded / timeout
  review --> failed: reject
  assigned --> queued: lease expired (reaper, +epoch, keep checkpoint)
  building --> queued: preempted (critical job, checkpoint + epoch bump)
  failed --> queued: requeue (operator)
  failed --> dead_letter: retries exhausted
  shipped --> [*]
  dead_letter --> [*]

Stages (types.ts): queued · blocked · assigned · building · review · testing · shipped · failed · dead_letter. The TUI/local board collapse these onto kanban buckets (inbox/building/review/testing/shipped/failed) for parity.


5. The core guarantee — atomic claim + lease fencing

This is the heart of "no double-assignment, ever" and "a dead worker can never corrupt a reassigned job."

sequenceDiagram
  participant FA as Factory A
  participant FB as Factory B
  participant CO as coordinator
  participant DB as fleet_jobs / fleet_leases

  FA->>CO: POST /fleet/claim (caps)
  FB->>CO: POST /fleet/claim (caps)
  CO->>DB: selectJob() → job J (rev=5)
  CO->>DB: revUpdate J: queued→assigned IF rev==5 (CAS)
  DB-->>CO: A wins (rev→6, leaseEpoch=1)
  CO->>DB: revUpdate J IF rev==5 (B's CAS)
  DB-->>CO: conflict (B re-selects)
  CO-->>FA: assigned J (leaseEpoch=1)
  CO-->>FB: conflict → next job

  Note over FA: A crashes mid-build
  CO->>DB: reapExpiredLeases(): lease expired → J back to queued,<br/>leaseEpoch=2, checkpoint preserved
  FB->>CO: claim → J (leaseEpoch=2)
  FA-->>CO: (zombie) PATCH J stage=shipped leaseEpoch=1
  CO-->>FA: 409 fenced (1 < 2) — rejected
  • CAS: repository.revUpdateJob/revUpdateLease write only if stored rev matches (Cosmos _etag/If-Match; memory provider re-reads rev).
  • Fencing: every worker mutation carries leaseEpoch; epoch < job.leaseEpochfenced (409).
  • Reaper: reapExpiredLeases(now) requeues expired-lease jobs, bumps the epoch, and keeps the checkpoint (WIP git branch pointer) so work resumes rather than restarts. Cosmos TTL cannot do this — the reaper owns recovery.

6. Data model (Cosmos containers)

Container PK Purpose
fleet_jobs /productId durable job: manifestSnapshot, verbatim bodyMd, stage, idempotencyKey, deps, depsMode, checkpoint, priority, rev, leaseEpoch, kind, parentId
fleet_runs /jobId one execution attempt: engine, timings, result, insights (tokens/cost/diff)
fleet_leases /jobId single-holder lease: holderFactoryId, expiresAt, leaseEpoch, status
fleet_factories /productId worker host: capabilities[], health, load, seatLimit, lastHeartbeatAt
fleet_profiles /productId immutable, versioned persona/capability profile snapshot
fleet_events /jobId append-only audit stream (monotonic seq) — powers SSE
fleet_artifacts /jobId pointers to blob-stored artifacts (no inline logs)
fleet_queue_state /productId Phase-4 M0 RU gate: monotonic version bumped on job create + every stage change; read via GET /fleet/queue-state so a factory can cheaply detect "work changed"

Every document carries productId. Containers registered in lib/cosmos-init.ts.


7. The scheduler / scoring router (scheduler.ts)

Pure, deterministic, fixed-weight (tunable per-product in Phase 3, learned in Phase 5). Filter → score → rank:

score = w1·capabilityFit + w2·affinity + w3·(1/(1+load))
      + w4·costFit(budget) + w5·health  w6·starvationPenalty(age)

Default weights (DEFAULT_WEIGHTS): capabilityFit 1.0 · affinity 0.5 · load 1.0 · costFit 0.75 · health 1.0 · starvation 1.5. Capability is a hard filter (subset check); down factories are filtered out, not scored; aging fully de-penalises after ~30 min (anti-starvation). scoreCandidate returns a per-term breakdown that powers the explainability panel (GET /fleet/jobs/:id/explainExplainPanel). selectPreemptionVictim picks the lowest-priority running job a critical job may evict (under FLEET_PREEMPTION).


8. Subsystems at a glance

Subsystem File(s) What it does Flag
Claim / lease / fence / reaper coordinator.ts exactly-once assignment, recovery
Scoring router + preemption scheduler.ts, coordinator.ts best-fit assignment, evict low-pri for critical FLEET_PREEMPTION
Per-product budgets coordinator.ts (accrueSpend, pause/resume) ceiling + auto-pause kill-switch; burndown FLEET_BUDGETS
DAG decomposition coordinator.ts (submitChildren, getDagSubtree, maybeUnblockParent) composite job fans out to children; deps gate parent
Review gate coordinator.ts (requestReview, submitReview) multi-reviewer quorum before ship
Factory enrollment enrollment.ts scoped, rotatable, hashed tokens; auth on claim/heartbeat
Tracker bridge tracker-bridge.ts idempotent ingest of tracker item → job; one-way status echo
Artifacts artifacts.ts, artifacts-blob.ts pointer docs in Cosmos, bytes in blob (SAS)
Live events routes.ts SSE + fleet_events GET /fleet/jobs/:id/events/stream
Metrics / alerts coordinator.ts (fleetMetrics) utilization, health rollup, starvation alerts

9. REST API surface (/fleet, under /api, auth + x-product-id)

Jobs       POST /fleet/jobs · GET /fleet/jobs · GET /fleet/jobs/:id
           PATCH /fleet/jobs/:id (fenced) · POST /fleet/jobs/:id/actions/:action
Claim      POST /fleet/claim
Lease      POST /fleet/jobs/:id/lease/renew · /lease/release
Factories  POST /fleet/factories/heartbeat · /enroll
           POST /fleet/factories/:id/token/rotate · /token/revoke
Runs/Events GET /fleet/jobs/:id/runs · /events · /events/stream (SSE) · /explain
Review     POST /fleet/jobs/:id/review/request · /review
Budgets    GET /fleet/budgets/:productId · /burndown
           PUT /fleet/budgets/:productId · POST /pause · /resume
DAG        POST /fleet/jobs/:id/children · GET /fleet/jobs/:id/dag
Artifacts  POST /fleet/jobs/:id/artifacts · GET (list) · GET/DELETE /fleet/artifacts/:id
Tracker    POST /fleet/tracker/ingest · /fleet/tracker/echo
Metrics    GET /fleet/metrics · GET /fleet/queue-state (Phase-4 M0 RU gate)

10. The two control planes & feature flags

Browser (tracker-web)dashboards/tracker-web/src/:

  • app/dashboard/fleet/page.tsx — fleet map (factory cards, health/load/caps, metrics + alerts)
  • app/dashboard/fleet/jobs/page.tsx — stage-filtered job table
  • app/dashboard/fleet/jobs/[id]/page.tsx — job detail: SSE event timeline, runs, artifacts, DAG view, ExplainPanel, ReviewGateCard, ship/requeue/reject
  • app/dashboard/fleet/budget/page.tsx — burndown chart + pause/resume kill-switch
  • lib/fleet-client.ts — typed client; subscribeJobEvents (fetch-based SSE w/ auth + Last-Event-ID resume + poll fallback); graceful 404 → null
  • app/api/fleet/[...path]/route.ts — proxy to platform-service

Terminal (agent-queue)learning_ai_devops_tools/agent-queue/:

  • dashboard.mjs (AQ_FLEET_DASH=1) → lib/fleet-dash.mjs adapter: board counts, factories (per-factory rows or metrics aggregate), alerts, running, actionable JOBS w/ tags, recent, per-job events log; ship/requeue/reject via /fleet. Local folder-queue mode byte-for-byte unchanged when the flag is off.

Feature flags

Flag Where Effect
FLEET_PREEMPTION platform-service enable critical-job preemption + seat limits
FLEET_BUDGETS platform-service enable budget enforcement + auto-pause
AQ_FLEET factory runner runner becomes a coordinator factory (claim/report)
AQ_FLEET_ROUTE / AQ_FLEET_SHADOW factory runner route via service / side-effect-free shadow compare
AQ_FLEET_DASH TUI dashboard sources board from /fleet API
AQ_FLEET_API / AQ_FLEET_TOKEN / AQ_PRODUCT_ID both base URL / bearer / x-product-id

All flags default off → the system is byte-for-byte the prior single-host tool.


11. Code map (where everything lives)

learning_ai_common_plat (the durable spine):

services/platform-service/src/modules/fleet/
  types.ts            Zod schemas + canonical model (stages, lease, budget, DAG, events)
  repository.ts       per-container CRUD + revUpdate CAS, appendEvent, listChildrenByParent
  coordinator.ts      submit/claim/lease/fence/reaper, preemption, budgets, DAG, review, metrics
  scheduler.ts        pure scoring router + selectPreemptionVictim + scoreCandidate (explain)
  enrollment.ts       factory enroll / rotate / revoke / enforceFactoryToken
  tracker-bridge.ts   ingest tracker item → job; one-way status echo
  artifacts.ts        artifact pointer mgmt
  artifacts-blob.ts   blob upload/download/delete (SAS)
  routes.ts           all /fleet REST + SSE
  *.test.ts           coordinator/scheduler/repository/routes/enrollment/tracker/artifacts/types
dashboards/tracker-web/src/
  app/dashboard/fleet/**          the browser control plane (pages above)
  lib/fleet-client.ts             typed client + SSE
  app/api/fleet/[...path]/route.ts proxy
  e2e/fleet.spec.ts               Playwright specs
lib/cosmos-init.ts                container registration
docs/GIGAFACTORY/gigafactory-phase3-progress.md / docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md

learning_ai_devops_tools (the factory agent + TUI + spec):

agent-queue/
  agent-queue.sh      single-host runner + factory agent (AQ_FLEET); budget.wall, retry, recover
  lib/fleet-client.sh curl-only coordinator client (register/claim/report/renew, fencing-aware)
  lib/fleet-dash.mjs  TUI fleet-mode adapter over /fleet (+ fleet-dash.test.mjs, 22 assertions)
  dashboard.mjs       the TUI (local + fleet modes)
  profiles/*.md       persona+capability catalog
  demo/two-factory-demo.sh + coordinator-stub.sh  parallel-fleet demo
  selftest.sh         ~75 dependency-light checks
  docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md   source-of-truth spec & checklists
  docs/GIGAFACTORY/GIGAFACTORY_SYSTEM_OVERVIEW.md   (this file)

12. Test coverage (what's verified)

  • platform-service fleet (~134+ tests): atomic-claim race (true concurrency, no double-assign), fencing rejection, reaper reclaim + checkpoint, scheduler scoring / tie-breaks / starvation / preemption-victim, DAG fan-out/unblock/subtree, budgets + burndown + auto-pause, review-gate quorum, enrollment/token lifecycle + auth enforcement, tracker ingest/echo idempotency, routes (incl. SSE + explain), schema validation.
  • tracker-web (~198 tests): fleet-client unit tests + page render; SSE parse/resume/fallback; graceful 404 degradation.
  • tracker-web e2e (e2e/fleet.spec.ts): fleet map, live log, ship, budget-pause, review-gate (Playwright — needs CI wiring).
  • agent-queue (selftest.sh, ~75 checks): manifest/profiles/caps/priority/deps/ idempotency, retry/recover/insights, tracker round-trip, AQ_FLEET register/claim/ fenced-patch/reaper-reclaim/quarantine, shadow AGREE/DIVERGE, two-factory demo, budget.wall enforcement, fleet-dash adapter (22 assertions).

13. Next steps

Immediate (close Phase 13 to a clean 100%):

  1. Validate the Cosmos _etag/If-Match CAS path under true contention and live blob-backed fleet_artifacts — the two items the roadmap marks as "remaining for a hard 100%" on Phase 2/3 (tests today use the memory provider + pointer-only artifacts).
  2. Wire e2e/fleet.spec.ts into CI (Playwright install + a verify job) so the Phase-3 exit criterion ("web verify incl. e2e green") is enforced, not just present.
  3. Live multi-host operator run end-to-end (the Phase-3 acceptance: drive the 3-repo parallel workload from the browser, including a budget pause + resume against a real platform-service, not the stub).

Phase 4 (scale-out) — in progress; see FLEET_DISPATCH_REDESIGN.md:

  • M0 (done) — RU gate: fleet_queue_state + GET /fleet/queue-state + AQ_FLEET_GATE; factories skip the claim while the queue version is unchanged.
  1. M1+: broker (the redesign picks Azure Service Bus, not NATS/Redis, for subscription filters + DLQ) for push dispatch + backpressure in a coordinator-owns-scheduling / broker-owns-delivery hybrid (keeps the scorer).
  2. M3: autoscaling — scale-to-zero ephemeral factories (KEDA/Container Apps) keyed to subscription depth.
  3. Capability marketplace — route rare-capability jobs (xcode/figma/gpu) to the few factories that have them; cross-product queueing fairness.
  4. Load + chaos suite — factory churn, broker outage, thundering herd.

Phase 5 (learned routing): 8. Capture per-run outcome features → offline eval harness (learned vs heuristic) → shadow/A-B with guardrails → surface recommendations ("route NomGap UX jobs to claude on mac-2: 23% faster").


14. Bugs, gaps & risks (be honest)

Documentation status (reconciled 2026-05-31):

  • GIGAFACTORY_ROADMAP.md §0 now reads Phase 0 100% · 1 ~98% · 2 ~98% · 3 100% · 4 ◐ in progress (~10%, M0 shipped) · 5 ☐. Phase-2 boxes for the scheduler core and factory enrollment/scoped tokens are ticked (scheduler.ts selectJob/selectPreemptionVictim wired into claimNextJob; enrollment.ts enforceFactoryToken gating claim/heartbeat). The earlier "stale §0 table" warning no longer applies.

Runtime / correctness gaps:

  • SSE is poll-fallback based, not a push-only contract. subscribeJobEvents falls back to getJobEvents() polling on stream error — fine for resilience, but "live" can silently degrade to polling without a visible operator signal.
  • UI pages degrade silently on some errors (empty states / null), which can mask a real backend outage as "nothing happening."
  • Budget page assumes ceilingUsd exists when rendering the spend bar — a budget doc without a ceiling could render a broken/NaN bar. Guard it.
  • Dashboard patchJob only sends {stage, leaseEpoch} — other fenced-transition fields (e.g. checkpoint) aren't exposed in the web UI, so operator-driven transitions can't carry a checkpoint.
  • rev CAS on the memory provider is exact only for the sequential calls the coordinator/tests make (re-read rev before write). Real concurrency safety depends on Cosmos _etag/If-Match in production — verify the Cosmos path under true contention before relying on it at scale.

TUI-specific (this repo):

  • Fleet utilization % only renders in the metrics-aggregate fallback branch, not when per-factory rows are present — a minor inconsistency in the TUI board.
  • The budget.wall live selftest is timing-sensitive (races a 2s wall ceiling) and can flake under heavy disk/CPU load; the code is correct but the test could be made more robust (e.g. inject the clock).
  • TUI fleet mode has no write path for budgets/preemption — it's read + job actions only; budget pause/resume is web-only.

Operational gotchas (verified on the live fleet — get these right):

  • Heartbeat cadence MUST be < the 90s stale threshold. fleet_metrics marks a factory stale after DEFAULT_STALE_FACTORY_MS = 90_000, but the factory only heartbeats every AQ_FLEET_LEASE_RENEW_SEC (default 300s). Left at the default, a healthy factory flaps to "stale"/"no live factory" between beats. The fleet launcher sets AQ_FLEET_LEASE_RENEW_SEC=30 to stay well inside the window.
  • The tracker-web New-Job form is misconfigured: it hardcodes factories mac-1/mac-2 and defaults capabilities=["build"] — a token no agent-queue factory advertises (detect_capabilities emits os:*/engine:*/node:*/has:*). So a default UI submission is unroutable (queues forever → queue_starvation). Fix tracked in the redesign doc's routing-model section.
  • No factory deregister API. Only heartbeat/enroll/rotate/revoke exist, so a dead factory's doc lingers and shows as stale until pruned out-of-band (currently a manual Cosmos delete). A prune/deregister path is a Phase-4 item.

Not-yet-built (expected, Phase 4+):

  • No message bus yet — dispatch is still poll-based, but the M0 RU gate now skips the claim while idle (so idle Cosmos RU is near-flat). Broker push/ backpressure is M1+.
  • No autoscaling — factory fleet is static/manually run (M3 target).
  • No capability marketplace / cross-product fairness under contention.
  • No load/chaos test suite — resilience is unit-proven, not load-proven.
  • Artifacts blob wiring (fleet_artifacts → real blob storage) should be validated against a live storage account (tests use memory/pointer only).

Recently fixed (2026-05-31):

  • run --once could return before a backgrounded worker finished the PR/report. _meta_end (which writes ended=) was called right after the testing/ move, before PR open/merge + coordinator reports, so the slot freed early and --once could exit (and a caller could observe completion) mid-PR. Now ended= is written last; the selftest PR-mode case is deterministic again.

15. TL;DR

Phases 03 are functionally complete and well-tested: a durable coordinator with exactly-once leasing + fencing + crash recovery, a deterministic scoring router with preemption + explainability, per-product budgets, DAG decomposition, a multi-reviewer gate, factory enrollment with scoped tokens, and two control planes (browser + TUI) over one /fleet API. The remaining work is (a) trivial doc corrections, (b) CI-enforcing the existing e2e, and (c) the genuinely new Phase-4 scale-out frontier (broker, autoscaling, marketplace, chaos) and Phase-5 learned routing.