bytelyst-devops-tools/agent-queue/demo/README.md
saravanakumardb1 38d8e8e5cf feat(agent-queue): add tracked example multi-product fleet launcher
The operational _start_fleet.sh lives in a local (untracked) sandbox, so the
gate + heartbeat-cadence settings weren't version-controlled anywhere. Add
demo/start-fleet.example.sh: a parameterized, sanitized launcher (one
agent-queue.sh run daemon per product against a live platform-service) that
ships the two settings you must get right — AQ_FLEET_GATE=1 (M0 RU gate) and
AQ_FLEET_LEASE_RENEW_SEC=30 (heartbeat cadence < the 90s stale threshold). No
hardcoded paths/secrets; everything env-overridable. Documented in demo/README.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:18:26 -07:00

84 lines
4.6 KiB
Markdown

# Two-Factory Parallel Demo (Phase-2 Exit Criteria, §14)
This demo closes the final Phase-2 exit-criteria box: **≥2 factories executing jobs in
parallel through one coordinator**, proving the concurrency guarantees end-to-end. It is a
**harness over the existing runtime** — it does *not* change `agent-queue.sh` or
`lib/fleet-client.sh`; it starts two real `agent-queue.sh run` daemons (distinct
factoryIds, separate queues/cwds) that compete **only** through the coordinator, then
observes and asserts.
## The three guarantees it proves
| # | Guarantee | How it's shown |
|---|-----------|----------------|
| **(a)** | **No double-assign** | Each of the 3 jobs is claimed/executed by exactly **one** factory. The coordinator's atomic claim (lock-guarded; only a `queued` job is claimable) means two concurrent claimers never get the same job version. |
| **(b)** | **Fencing + reclaim** | One factory is **killed mid-job**. The reaper returns its in-flight job to `queued` with a **bumped lease epoch**; the surviving factory **reclaims and completes** it. The dead worker's late/zombie report (stale epoch) is **fenced (HTTP 409)** and never ships. |
| **(c)** | **Parallelism** | Both factories hold an active job **simultaneously** (observed in coordinator state) — work is concurrent, not serialized. |
## Run it
### Stub mode (default, zero dependencies, CI-safe)
```bash
bash demo/two-factory-demo.sh
```
Drives [`coordinator-stub.sh`](coordinator-stub.sh) — a stateful, lock-guarded, file-backed
coordinator that implements the same claim / lease / fence / reaper contract as
platform-service, via the existing `AQ_FLEET_API_CMD` test seam. No platform-service, no
Cosmos, no network. This is exactly what `selftest.sh` runs headlessly.
### Real-coordinator mode (against a live platform-service)
```bash
DEMO_MODE=real \
AQ_FLEET_API=http://localhost:4003/api \
AQ_FLEET_TOKEN=<bearer> \
AQ_PRODUCT_ID=<product> \
bash demo/two-factory-demo.sh
```
In real mode the demo submits via the platform-service fleet API and relies on the
coordinator's **own lease reaper** to reclaim the killed factory's job (it waits
`DEMO_REAP_WAIT` seconds; pair with a short `AQ_FLEET_LEASE_SECONDS` so the lease expires
quickly). Submit endpoint is overridable via `DEMO_SUBMIT_PATH` (default `/fleet/jobs`).
Real mode is observational/best-effort — the machine-checked assertions run in stub mode
(and in `selftest.sh`).
## Env knobs
| Var | Default | Meaning |
|-----|---------|---------|
| `DEMO_MODE` | `stub` | `stub` or `real` (auto-set to `real` when `AQ_FLEET_API`+`AQ_FLEET_TOKEN` are set and `DEMO_MODE``stub`) |
| `DEMO_JOB_SLEEP` | `2` | per-job engine seconds — the window during which the victim is killed mid-job |
| `DEMO_TIMEOUT` | `60` | max seconds to wait for the survivor to drain all 3 jobs |
| `DEMO_POLL` | `0.2` | coordinator-state poll interval |
| `DEMO_FACTORY_1` / `DEMO_FACTORY_2` | `mac-1` / `ubuntu-1` | factory ids (F1 is the victim) |
| `DEMO_KEEP` | `0` | `1` keeps the temp dir (queues, logs, coordinator state) for inspection |
| `DEMO_REAP_WAIT` / `DEMO_DRAIN_WAIT` | `20` / `30` | real-mode waits for the coordinator reaper / drain |
## What to watch
The demo prints a step-by-step trace and a final `RESULTS` block. The key lines:
- `PARALLELISM observed: mac-1 and ubuntu-1 both holding active jobs concurrently` — guarantee (c).
- `killed factory mac-1 ... mid-job` then `reaper reclaimed mac-1's lease(s)` — the crash + reclaim.
- `zombie report for <job> @epoch=N was FENCED (HTTP 409)` — guarantee (b) fencing.
- `RESULTS` shows each job's winning factory; the reclaimed job's winner is the **survivor**.
With `DEMO_KEEP=1`, inspect under the printed temp dir:
- `coord/events.log` — the coordinator's audit trail: `CLAIM` / `PATCH:<stage>` / `RECLAIM` / `FENCE` events (factory + epoch on each).
- `coord/jobs/<id>.job` — final per-job `stage` / `holder` / `epoch`.
- `log-mac-1.txt`, `log-ubuntu-1.txt` — each factory's run-loop log (claims, the `▶ launching`, the fenced/quarantine path on the killed worker).
## Files
- `two-factory-demo.sh` — the orchestrator (start factories, kill/reclaim/fence, assert).
- `coordinator-stub.sh` — the stateful coordinator stub (claim/patch/fence/renew/release/reap, mkdir-locked).
- `start-fleet.example.sh` — reference launcher for a **real** multi-product local
fleet against a live platform-service (one `agent-queue.sh run` daemon per
product). Parameterized via env; ships the two settings you must get right —
`AQ_FLEET_GATE=1` (M0 RU gate) and `AQ_FLEET_LEASE_RENEW_SEC=30` (heartbeat
cadence < the 90s stale threshold). Copy + adjust for your sandbox.