The operational _start_fleet.sh lives in a local (untracked) sandbox, so the gate + heartbeat-cadence settings weren't version-controlled anywhere. Add demo/start-fleet.example.sh: a parameterized, sanitized launcher (one agent-queue.sh run daemon per product against a live platform-service) that ships the two settings you must get right — AQ_FLEET_GATE=1 (M0 RU gate) and AQ_FLEET_LEASE_RENEW_SEC=30 (heartbeat cadence < the 90s stale threshold). No hardcoded paths/secrets; everything env-overridable. Documented in demo/README. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| coordinator-stub.sh | ||
| README.md | ||
| start-fleet.example.sh | ||
| two-factory-demo.sh | ||
Two-Factory Parallel Demo (Phase-2 Exit Criteria, §14)
This demo closes the final Phase-2 exit-criteria box: ≥2 factories executing jobs in
parallel through one coordinator, proving the concurrency guarantees end-to-end. It is a
harness over the existing runtime — it does not change agent-queue.sh or
lib/fleet-client.sh; it starts two real agent-queue.sh run daemons (distinct
factoryIds, separate queues/cwds) that compete only through the coordinator, then
observes and asserts.
The three guarantees it proves
| # | Guarantee | How it's shown |
|---|---|---|
| (a) | No double-assign | Each of the 3 jobs is claimed/executed by exactly one factory. The coordinator's atomic claim (lock-guarded; only a queued job is claimable) means two concurrent claimers never get the same job version. |
| (b) | Fencing + reclaim | One factory is killed mid-job. The reaper returns its in-flight job to queued with a bumped lease epoch; the surviving factory reclaims and completes it. The dead worker's late/zombie report (stale epoch) is fenced (HTTP 409) and never ships. |
| (c) | Parallelism | Both factories hold an active job simultaneously (observed in coordinator state) — work is concurrent, not serialized. |
Run it
Stub mode (default, zero dependencies, CI-safe)
bash demo/two-factory-demo.sh
Drives coordinator-stub.sh — a stateful, lock-guarded, file-backed
coordinator that implements the same claim / lease / fence / reaper contract as
platform-service, via the existing AQ_FLEET_API_CMD test seam. No platform-service, no
Cosmos, no network. This is exactly what selftest.sh runs headlessly.
Real-coordinator mode (against a live platform-service)
DEMO_MODE=real \
AQ_FLEET_API=http://localhost:4003/api \
AQ_FLEET_TOKEN=<bearer> \
AQ_PRODUCT_ID=<product> \
bash demo/two-factory-demo.sh
In real mode the demo submits via the platform-service fleet API and relies on the
coordinator's own lease reaper to reclaim the killed factory's job (it waits
DEMO_REAP_WAIT seconds; pair with a short AQ_FLEET_LEASE_SECONDS so the lease expires
quickly). Submit endpoint is overridable via DEMO_SUBMIT_PATH (default /fleet/jobs).
Real mode is observational/best-effort — the machine-checked assertions run in stub mode
(and in selftest.sh).
Env knobs
| Var | Default | Meaning |
|---|---|---|
DEMO_MODE |
stub |
stub or real (auto-set to real when AQ_FLEET_API+AQ_FLEET_TOKEN are set and DEMO_MODE ≠ stub) |
DEMO_JOB_SLEEP |
2 |
per-job engine seconds — the window during which the victim is killed mid-job |
DEMO_TIMEOUT |
60 |
max seconds to wait for the survivor to drain all 3 jobs |
DEMO_POLL |
0.2 |
coordinator-state poll interval |
DEMO_FACTORY_1 / DEMO_FACTORY_2 |
mac-1 / ubuntu-1 |
factory ids (F1 is the victim) |
DEMO_KEEP |
0 |
1 keeps the temp dir (queues, logs, coordinator state) for inspection |
DEMO_REAP_WAIT / DEMO_DRAIN_WAIT |
20 / 30 |
real-mode waits for the coordinator reaper / drain |
What to watch
The demo prints a step-by-step trace and a final RESULTS block. The key lines:
PARALLELISM observed: mac-1 and ubuntu-1 both holding active jobs concurrently— guarantee (c).killed factory mac-1 ... mid-jobthenreaper reclaimed mac-1's lease(s)— the crash + reclaim.zombie report for <job> @epoch=N was FENCED (HTTP 409)— guarantee (b) fencing.RESULTSshows each job's winning factory; the reclaimed job's winner is the survivor.
With DEMO_KEEP=1, inspect under the printed temp dir:
coord/events.log— the coordinator's audit trail:CLAIM/PATCH:<stage>/RECLAIM/FENCEevents (factory + epoch on each).coord/jobs/<id>.job— final per-jobstage/holder/epoch.log-mac-1.txt,log-ubuntu-1.txt— each factory's run-loop log (claims, the▶ launching, the fenced/quarantine path on the killed worker).
Files
two-factory-demo.sh— the orchestrator (start factories, kill/reclaim/fence, assert).coordinator-stub.sh— the stateful coordinator stub (claim/patch/fence/renew/release/reap, mkdir-locked).start-fleet.example.sh— reference launcher for a real multi-product local fleet against a live platform-service (oneagent-queue.sh rundaemon per product). Parameterized via env; ships the two settings you must get right —AQ_FLEET_GATE=1(M0 RU gate) andAQ_FLEET_LEASE_RENEW_SEC=30(heartbeat cadence < the 90s stale threshold). Copy + adjust for your sandbox.