bytelyst-devops-tools/agent-queue/demo/README.md
saravanakumardb1 0cde7def6a feat(agent-queue): two-factory parallel demo — Phase 2 exit criteria (§14)
Close the final Phase-2 exit-criteria box: >=2 factories executing jobs in parallel
through one coordinator, proving the concurrency guarantees end-to-end. This is a
DEMO HARNESS over the existing runtime — agent-queue.sh and lib/fleet-client.sh are
unchanged (read + called, not modified).

demo/two-factory-demo.sh: starts two real `agent-queue.sh run` daemons (mac-1 +
ubuntu-1, separate queues/cwds) that compete ONLY through the coordinator, then
asserts: (a) no double-assign — each of 3 jobs executed by exactly one factory;
(b) fencing + reclaim — kill a factory mid-job, the reaper returns its job, the
survivor reclaims + completes it, and the dead worker's late/zombie report (stale
leaseEpoch) is FENCED (HTTP 409, never shipped); (c) parallelism — both factories
hold active jobs concurrently. Dual-mode: CI-safe stateful stub by default; live
platform-service when AQ_FLEET_API/AQ_FLEET_TOKEN set.

demo/coordinator-stub.sh: stateful, mkdir-lock-guarded, file-backed coordinator
implementing claim/lease/fence/renew/release + reaper-reclaim via the existing
AQ_FLEET_API_CMD seam — the selftest stub pattern extended with shared state so
>=2 processes coordinate through one coordinator.

demo/README.md: stub + real invocations, env knobs, what each guarantee proves,
what-to-watch guide.

selftest.sh: +3 headless stub-mode checks (existing 68 unchanged byte-for-byte ->
71 total green).

docs/GIGAFACTORY_ROADMAP.md: tick the §14 two-factory-demo box; annotate Phase-2
exit criteria; bump §0 Phase 2 to 80% (remaining: scheduler-core wiring [common-plat
PR #31], tracker-direct call, factory enrollment).

bash 3.2 + awk/sed/grep/pgrep only; mac+linux safe; no new runtime deps.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 01:53:36 -07:00

79 lines
4.2 KiB
Markdown

# Two-Factory Parallel Demo (Phase-2 Exit Criteria, §14)
This demo closes the final Phase-2 exit-criteria box: **≥2 factories executing jobs in
parallel through one coordinator**, proving the concurrency guarantees end-to-end. It is a
**harness over the existing runtime** — it does *not* change `agent-queue.sh` or
`lib/fleet-client.sh`; it starts two real `agent-queue.sh run` daemons (distinct
factoryIds, separate queues/cwds) that compete **only** through the coordinator, then
observes and asserts.
## The three guarantees it proves
| # | Guarantee | How it's shown |
|---|-----------|----------------|
| **(a)** | **No double-assign** | Each of the 3 jobs is claimed/executed by exactly **one** factory. The coordinator's atomic claim (lock-guarded; only a `queued` job is claimable) means two concurrent claimers never get the same job version. |
| **(b)** | **Fencing + reclaim** | One factory is **killed mid-job**. The reaper returns its in-flight job to `queued` with a **bumped lease epoch**; the surviving factory **reclaims and completes** it. The dead worker's late/zombie report (stale epoch) is **fenced (HTTP 409)** and never ships. |
| **(c)** | **Parallelism** | Both factories hold an active job **simultaneously** (observed in coordinator state) — work is concurrent, not serialized. |
## Run it
### Stub mode (default, zero dependencies, CI-safe)
```bash
bash demo/two-factory-demo.sh
```
Drives [`coordinator-stub.sh`](coordinator-stub.sh) — a stateful, lock-guarded, file-backed
coordinator that implements the same claim / lease / fence / reaper contract as
platform-service, via the existing `AQ_FLEET_API_CMD` test seam. No platform-service, no
Cosmos, no network. This is exactly what `selftest.sh` runs headlessly.
### Real-coordinator mode (against a live platform-service)
```bash
DEMO_MODE=real \
AQ_FLEET_API=http://localhost:4003/api \
AQ_FLEET_TOKEN=<bearer> \
AQ_PRODUCT_ID=<product> \
bash demo/two-factory-demo.sh
```
In real mode the demo submits via the platform-service fleet API and relies on the
coordinator's **own lease reaper** to reclaim the killed factory's job (it waits
`DEMO_REAP_WAIT` seconds; pair with a short `AQ_FLEET_LEASE_SECONDS` so the lease expires
quickly). Submit endpoint is overridable via `DEMO_SUBMIT_PATH` (default `/fleet/jobs`).
Real mode is observational/best-effort — the machine-checked assertions run in stub mode
(and in `selftest.sh`).
## Env knobs
| Var | Default | Meaning |
|-----|---------|---------|
| `DEMO_MODE` | `stub` | `stub` or `real` (auto-set to `real` when `AQ_FLEET_API`+`AQ_FLEET_TOKEN` are set and `DEMO_MODE``stub`) |
| `DEMO_JOB_SLEEP` | `2` | per-job engine seconds — the window during which the victim is killed mid-job |
| `DEMO_TIMEOUT` | `60` | max seconds to wait for the survivor to drain all 3 jobs |
| `DEMO_POLL` | `0.2` | coordinator-state poll interval |
| `DEMO_FACTORY_1` / `DEMO_FACTORY_2` | `mac-1` / `ubuntu-1` | factory ids (F1 is the victim) |
| `DEMO_KEEP` | `0` | `1` keeps the temp dir (queues, logs, coordinator state) for inspection |
| `DEMO_REAP_WAIT` / `DEMO_DRAIN_WAIT` | `20` / `30` | real-mode waits for the coordinator reaper / drain |
## What to watch
The demo prints a step-by-step trace and a final `RESULTS` block. The key lines:
- `PARALLELISM observed: mac-1 and ubuntu-1 both holding active jobs concurrently` — guarantee (c).
- `killed factory mac-1 ... mid-job` then `reaper reclaimed mac-1's lease(s)` — the crash + reclaim.
- `zombie report for <job> @epoch=N was FENCED (HTTP 409)` — guarantee (b) fencing.
- `RESULTS` shows each job's winning factory; the reclaimed job's winner is the **survivor**.
With `DEMO_KEEP=1`, inspect under the printed temp dir:
- `coord/events.log` — the coordinator's audit trail: `CLAIM` / `PATCH:<stage>` / `RECLAIM` / `FENCE` events (factory + epoch on each).
- `coord/jobs/<id>.job` — final per-job `stage` / `holder` / `epoch`.
- `log-mac-1.txt`, `log-ubuntu-1.txt` — each factory's run-loop log (claims, the `▶ launching`, the fenced/quarantine path on the killed worker).
## Files
- `two-factory-demo.sh` — the orchestrator (start factories, kill/reclaim/fence, assert).
- `coordinator-stub.sh` — the stateful coordinator stub (claim/patch/fence/renew/release/reap, mkdir-locked).