Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add GIGAFACTORY_SYSTEM_OVERVIEW.md — a current-state companion to the roadmap
spec covering: what the Agent Gigafactory is, a completion snapshot, three
Mermaid diagrams (component architecture, job-lifecycle state machine, atomic
claim + lease-fencing sequence), the Cosmos data model, the scoring router,
subsystem map, full /fleet REST surface, feature flags, the two control planes,
a cross-repo code map, test coverage, next steps (Phase 4/5), and an honest
bugs/gaps/risks section. All three Mermaid blocks validated with mermaid.parse.
Also correct documentation drift in GIGAFACTORY_ROADMAP.md found during the
review:
- §0 progress table showed Phase 3 as "0% not started" while every Phase-3 box
is ticked; updated phases 1-3 to done with realistic percentages.
- Phase-2 boxes "scheduler/router wired into assignment", "tracker adapter
direct call", and "factory enrollment + scoped tokens" are implemented in
common-plat (coordinator.ts uses selectJob; routes.ts enforces
enrollment.enforceFactoryToken; tracker-bridge.ts) but were left unticked —
ticked with evidence and refreshed the stale "remaining for 100%" notes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an opt-in fleet mode to the dashboard so an operator can drive the
coordinator fleet from the same TUI used for the local folder queue.
- lib/fleet-dash.mjs: dependency-injectable read/act adapter over the
platform-service /fleet REST surface (jobs, metrics, factories, events,
ship/requeue/reject). Pure-ish + fully unit-testable without a live service.
- dashboard.mjs: render + act in fleet mode when AQ_FLEET_DASH=1 — board with
counts, factories (per-factory rows or metrics aggregate), alerts, running
(by lease/factory), actionable JOBS with manifest tags, recent, and a
per-job events log. Single-flight async refresh keeps the last good board on
failure; ship re-GETs a fresh leaseEpoch before PATCH; run/stop/promote are
disabled (no safe server contract). Local mode is byte-for-byte unchanged.
- lib/fleet-dash.test.mjs: 22 node:assert assertions (config, stage mapping,
toBoard, fetch headers/timeout/errors, board assembly + graceful degradation,
events, job actions) wired into selftest.sh.
- docs: tick the Phase 3 "TUI re-pointed at /fleet" roadmap boxes.
Verified: selftest.sh green (incl. new fleet-dash checks); live non-TTY render
smoke against a stub /fleet server (both factories and metrics-aggregate paths);
local mode unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Parse the wall ceiling from the budget manifest map (budget: { wall: <dur> })
and arm it alongside the per-run timeout. Whichever ceiling fires first binds;
the kill is recorded as result=timeout or result=budget_exceeded accordingly.
budget.wall extends timeout: a job with only a budget.wall (no timeout) is now
hard-killed at the ceiling. budget_exceeded is a terminal, non-retryable class
by default and maps to the failed tracker status.
Adds _budget_wall_secs + _effective_kill helpers (pure, unit-tested) and live
selftest coverage; usd/tokens remain best-effort and are not enforced here.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Close the final Phase-2 exit-criteria box: >=2 factories executing jobs in parallel
through one coordinator, proving the concurrency guarantees end-to-end. This is a
DEMO HARNESS over the existing runtime — agent-queue.sh and lib/fleet-client.sh are
unchanged (read + called, not modified).
demo/two-factory-demo.sh: starts two real `agent-queue.sh run` daemons (mac-1 +
ubuntu-1, separate queues/cwds) that compete ONLY through the coordinator, then
asserts: (a) no double-assign — each of 3 jobs executed by exactly one factory;
(b) fencing + reclaim — kill a factory mid-job, the reaper returns its job, the
survivor reclaims + completes it, and the dead worker's late/zombie report (stale
leaseEpoch) is FENCED (HTTP 409, never shipped); (c) parallelism — both factories
hold active jobs concurrently. Dual-mode: CI-safe stateful stub by default; live
platform-service when AQ_FLEET_API/AQ_FLEET_TOKEN set.
demo/coordinator-stub.sh: stateful, mkdir-lock-guarded, file-backed coordinator
implementing claim/lease/fence/renew/release + reaper-reclaim via the existing
AQ_FLEET_API_CMD seam — the selftest stub pattern extended with shared state so
>=2 processes coordinate through one coordinator.
demo/README.md: stub + real invocations, env knobs, what each guarantee proves,
what-to-watch guide.
selftest.sh: +3 headless stub-mode checks (existing 68 unchanged byte-for-byte ->
71 total green).
docs/GIGAFACTORY_ROADMAP.md: tick the §14 two-factory-demo box; annotate Phase-2
exit criteria; bump §0 Phase 2 to 80% (remaining: scheduler-core wiring [common-plat
PR #31], tracker-direct call, factory enrollment).
bash 3.2 + awk/sed/grep/pgrep only; mac+linux safe; no new runtime deps.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add a safe, reversible path to validate the fleet coordinator against the proven
single-host path BEFORE cutover, via three independently-toggleable flags:
AQ_FLEET=0 pure offline (zero coordinator calls; offline path unchanged)
AQ_FLEET_ROUTE=1 route_via_service: coordinator authoritative for claim (default = P2-S3)
AQ_FLEET_ROUTE=0 local inbox authoritative (coordinator not used to source work)
AQ_FLEET_SHADOW=1 dual-run (needs AQ_FLEET=1 + ROUTE=0): query coordinator in parallel,
record divergence, NEVER act on it
Precedence: SHADOW only when ROUTE=0; if ROUTE=1 + SHADOW=1, ROUTE wins (one-shot warning).
lib/fleet-client.sh: fleet_route_enabled / fleet_shadow_enabled / fleet_flags_warn_once /
fleet_flags_state; fleet_shadow_claim (read-only — isolated `-shadow` factoryId +
dryRun, releases any real lease, never materializes), fleet_shadow_compare
(AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY → .state/fleet-shadow.log), fleet_shadow_report
(shadow:true, response never acted on), cmd_fleet_shadow_report (counts + agreement rate).
agent-queue.sh: ROUTE-gate claim sourcing (claim only when route_via_service);
shadow hook after the local authoritative decision each iteration (best-effort,
error-swallowed — shadow can never fail a real job); `fleet-shadow-report` subcommand
+ help; resolved flags surfaced in `status`/`fleet-status`. tryClaim/fence/offline
paths unchanged.
Strictly side-effect-free on real job state: shadow never ships, quarantines, or
mutates real jobs. Offline path byte-for-byte unchanged when AQ_FLEET=0.
selftest.sh: +8 checks (shadow AGREE/DIVERGE/COORD_EMPTY, non-fatal 5xx, ROUTE
precedence, ROUTE=0 local-authoritative, fleet-shadow-report summary, shadow_report
unit). 60 prior checks unchanged → 68 total green. README + GIGAFACTORY_ROADMAP
document the flag model + cutover ladder.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
New §24 + feature-catalog row:
- two delegation modes: atomic (leaf bug/feature/task) vs composite (roadmap/epic)
- introduce job kind (leaf|composite); composite routes to a planner/orchestrator
that fans out child leaf jobs as a DAG across factories/agents/profiles
- parentId hierarchy + rollup semantics (status/budget/verify/phase-gates) +
idempotent re-run (skip shipped children)
- source-of-truth/sync discipline (one record referenced by many; one-way echo)
- HYBRID decision recorded: model kind/parentId/rollup in the fleet layer now,
keep shared tracker ITEM_TYPES unchanged (label kind:roadmap), promote to a
first-class epic type later via additive migration once proven
- phasing: leaf-only P1-P2; manual composite P3; auto-decomposition planner P3->P5
- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision
(factory x tool x profile routing) as a checklist-driven, phased
implementation roadmap (Phase 0-5) with acceptance criteria, verify
gates, and a 100% Definition-of-Done rubric
- committed path: coordinator as a platform-service module + control
plane on tracker-web, reached via a thin tracker adapter first; bash
runner survives as the offline edge factory agent
- README: add vision/roadmap pointer