bytelyst-devops-tools/agent-queue/demo/README.md
saravanakumardb1 0cde7def6a feat(agent-queue): two-factory parallel demo — Phase 2 exit criteria (§14)
Close the final Phase-2 exit-criteria box: >=2 factories executing jobs in parallel
through one coordinator, proving the concurrency guarantees end-to-end. This is a
DEMO HARNESS over the existing runtime — agent-queue.sh and lib/fleet-client.sh are
unchanged (read + called, not modified).

demo/two-factory-demo.sh: starts two real `agent-queue.sh run` daemons (mac-1 +
ubuntu-1, separate queues/cwds) that compete ONLY through the coordinator, then
asserts: (a) no double-assign — each of 3 jobs executed by exactly one factory;
(b) fencing + reclaim — kill a factory mid-job, the reaper returns its job, the
survivor reclaims + completes it, and the dead worker's late/zombie report (stale
leaseEpoch) is FENCED (HTTP 409, never shipped); (c) parallelism — both factories
hold active jobs concurrently. Dual-mode: CI-safe stateful stub by default; live
platform-service when AQ_FLEET_API/AQ_FLEET_TOKEN set.

demo/coordinator-stub.sh: stateful, mkdir-lock-guarded, file-backed coordinator
implementing claim/lease/fence/renew/release + reaper-reclaim via the existing
AQ_FLEET_API_CMD seam — the selftest stub pattern extended with shared state so
>=2 processes coordinate through one coordinator.

demo/README.md: stub + real invocations, env knobs, what each guarantee proves,
what-to-watch guide.

selftest.sh: +3 headless stub-mode checks (existing 68 unchanged byte-for-byte ->
71 total green).

docs/GIGAFACTORY_ROADMAP.md: tick the §14 two-factory-demo box; annotate Phase-2
exit criteria; bump §0 Phase 2 to 80% (remaining: scheduler-core wiring [common-plat
PR #31], tracker-direct call, factory enrollment).

bash 3.2 + awk/sed/grep/pgrep only; mac+linux safe; no new runtime deps.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 01:53:36 -07:00

4.2 KiB

Two-Factory Parallel Demo (Phase-2 Exit Criteria, §14)

This demo closes the final Phase-2 exit-criteria box: ≥2 factories executing jobs in parallel through one coordinator, proving the concurrency guarantees end-to-end. It is a harness over the existing runtime — it does not change agent-queue.sh or lib/fleet-client.sh; it starts two real agent-queue.sh run daemons (distinct factoryIds, separate queues/cwds) that compete only through the coordinator, then observes and asserts.

The three guarantees it proves

# Guarantee How it's shown
(a) No double-assign Each of the 3 jobs is claimed/executed by exactly one factory. The coordinator's atomic claim (lock-guarded; only a queued job is claimable) means two concurrent claimers never get the same job version.
(b) Fencing + reclaim One factory is killed mid-job. The reaper returns its in-flight job to queued with a bumped lease epoch; the surviving factory reclaims and completes it. The dead worker's late/zombie report (stale epoch) is fenced (HTTP 409) and never ships.
(c) Parallelism Both factories hold an active job simultaneously (observed in coordinator state) — work is concurrent, not serialized.

Run it

Stub mode (default, zero dependencies, CI-safe)

bash demo/two-factory-demo.sh

Drives coordinator-stub.sh — a stateful, lock-guarded, file-backed coordinator that implements the same claim / lease / fence / reaper contract as platform-service, via the existing AQ_FLEET_API_CMD test seam. No platform-service, no Cosmos, no network. This is exactly what selftest.sh runs headlessly.

Real-coordinator mode (against a live platform-service)

DEMO_MODE=real \
  AQ_FLEET_API=http://localhost:4003/api \
  AQ_FLEET_TOKEN=<bearer> \
  AQ_PRODUCT_ID=<product> \
  bash demo/two-factory-demo.sh

In real mode the demo submits via the platform-service fleet API and relies on the coordinator's own lease reaper to reclaim the killed factory's job (it waits DEMO_REAP_WAIT seconds; pair with a short AQ_FLEET_LEASE_SECONDS so the lease expires quickly). Submit endpoint is overridable via DEMO_SUBMIT_PATH (default /fleet/jobs). Real mode is observational/best-effort — the machine-checked assertions run in stub mode (and in selftest.sh).

Env knobs

Var Default Meaning
DEMO_MODE stub stub or real (auto-set to real when AQ_FLEET_API+AQ_FLEET_TOKEN are set and DEMO_MODEstub)
DEMO_JOB_SLEEP 2 per-job engine seconds — the window during which the victim is killed mid-job
DEMO_TIMEOUT 60 max seconds to wait for the survivor to drain all 3 jobs
DEMO_POLL 0.2 coordinator-state poll interval
DEMO_FACTORY_1 / DEMO_FACTORY_2 mac-1 / ubuntu-1 factory ids (F1 is the victim)
DEMO_KEEP 0 1 keeps the temp dir (queues, logs, coordinator state) for inspection
DEMO_REAP_WAIT / DEMO_DRAIN_WAIT 20 / 30 real-mode waits for the coordinator reaper / drain

What to watch

The demo prints a step-by-step trace and a final RESULTS block. The key lines:

  • PARALLELISM observed: mac-1 and ubuntu-1 both holding active jobs concurrently — guarantee (c).
  • killed factory mac-1 ... mid-job then reaper reclaimed mac-1's lease(s) — the crash + reclaim.
  • zombie report for <job> @epoch=N was FENCED (HTTP 409) — guarantee (b) fencing.
  • RESULTS shows each job's winning factory; the reclaimed job's winner is the survivor.

With DEMO_KEEP=1, inspect under the printed temp dir:

  • coord/events.log — the coordinator's audit trail: CLAIM / PATCH:<stage> / RECLAIM / FENCE events (factory + epoch on each).
  • coord/jobs/<id>.job — final per-job stage / holder / epoch.
  • log-mac-1.txt, log-ubuntu-1.txt — each factory's run-loop log (claims, the ▶ launching, the fenced/quarantine path on the killed worker).

Files

  • two-factory-demo.sh — the orchestrator (start factories, kill/reclaim/fence, assert).
  • coordinator-stub.sh — the stateful coordinator stub (claim/patch/fence/renew/release/reap, mkdir-locked).