The operational _start_fleet.sh lives in a local (untracked) sandbox, so the gate + heartbeat-cadence settings weren't version-controlled anywhere. Add demo/start-fleet.example.sh: a parameterized, sanitized launcher (one agent-queue.sh run daemon per product against a live platform-service) that ships the two settings you must get right — AQ_FLEET_GATE=1 (M0 RU gate) and AQ_FLEET_LEASE_RENEW_SEC=30 (heartbeat cadence < the 90s stale threshold). No hardcoded paths/secrets; everything env-overridable. Documented in demo/README. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
84 lines
4.6 KiB
Markdown
84 lines
4.6 KiB
Markdown
# Two-Factory Parallel Demo (Phase-2 Exit Criteria, §14)
|
|
|
|
This demo closes the final Phase-2 exit-criteria box: **≥2 factories executing jobs in
|
|
parallel through one coordinator**, proving the concurrency guarantees end-to-end. It is a
|
|
**harness over the existing runtime** — it does *not* change `agent-queue.sh` or
|
|
`lib/fleet-client.sh`; it starts two real `agent-queue.sh run` daemons (distinct
|
|
factoryIds, separate queues/cwds) that compete **only** through the coordinator, then
|
|
observes and asserts.
|
|
|
|
## The three guarantees it proves
|
|
|
|
| # | Guarantee | How it's shown |
|
|
|---|-----------|----------------|
|
|
| **(a)** | **No double-assign** | Each of the 3 jobs is claimed/executed by exactly **one** factory. The coordinator's atomic claim (lock-guarded; only a `queued` job is claimable) means two concurrent claimers never get the same job version. |
|
|
| **(b)** | **Fencing + reclaim** | One factory is **killed mid-job**. The reaper returns its in-flight job to `queued` with a **bumped lease epoch**; the surviving factory **reclaims and completes** it. The dead worker's late/zombie report (stale epoch) is **fenced (HTTP 409)** and never ships. |
|
|
| **(c)** | **Parallelism** | Both factories hold an active job **simultaneously** (observed in coordinator state) — work is concurrent, not serialized. |
|
|
|
|
## Run it
|
|
|
|
### Stub mode (default, zero dependencies, CI-safe)
|
|
|
|
```bash
|
|
bash demo/two-factory-demo.sh
|
|
```
|
|
|
|
Drives [`coordinator-stub.sh`](coordinator-stub.sh) — a stateful, lock-guarded, file-backed
|
|
coordinator that implements the same claim / lease / fence / reaper contract as
|
|
platform-service, via the existing `AQ_FLEET_API_CMD` test seam. No platform-service, no
|
|
Cosmos, no network. This is exactly what `selftest.sh` runs headlessly.
|
|
|
|
### Real-coordinator mode (against a live platform-service)
|
|
|
|
```bash
|
|
DEMO_MODE=real \
|
|
AQ_FLEET_API=http://localhost:4003/api \
|
|
AQ_FLEET_TOKEN=<bearer> \
|
|
AQ_PRODUCT_ID=<product> \
|
|
bash demo/two-factory-demo.sh
|
|
```
|
|
|
|
In real mode the demo submits via the platform-service fleet API and relies on the
|
|
coordinator's **own lease reaper** to reclaim the killed factory's job (it waits
|
|
`DEMO_REAP_WAIT` seconds; pair with a short `AQ_FLEET_LEASE_SECONDS` so the lease expires
|
|
quickly). Submit endpoint is overridable via `DEMO_SUBMIT_PATH` (default `/fleet/jobs`).
|
|
Real mode is observational/best-effort — the machine-checked assertions run in stub mode
|
|
(and in `selftest.sh`).
|
|
|
|
## Env knobs
|
|
|
|
| Var | Default | Meaning |
|
|
|-----|---------|---------|
|
|
| `DEMO_MODE` | `stub` | `stub` or `real` (auto-set to `real` when `AQ_FLEET_API`+`AQ_FLEET_TOKEN` are set and `DEMO_MODE` ≠ `stub`) |
|
|
| `DEMO_JOB_SLEEP` | `2` | per-job engine seconds — the window during which the victim is killed mid-job |
|
|
| `DEMO_TIMEOUT` | `60` | max seconds to wait for the survivor to drain all 3 jobs |
|
|
| `DEMO_POLL` | `0.2` | coordinator-state poll interval |
|
|
| `DEMO_FACTORY_1` / `DEMO_FACTORY_2` | `mac-1` / `ubuntu-1` | factory ids (F1 is the victim) |
|
|
| `DEMO_KEEP` | `0` | `1` keeps the temp dir (queues, logs, coordinator state) for inspection |
|
|
| `DEMO_REAP_WAIT` / `DEMO_DRAIN_WAIT` | `20` / `30` | real-mode waits for the coordinator reaper / drain |
|
|
|
|
## What to watch
|
|
|
|
The demo prints a step-by-step trace and a final `RESULTS` block. The key lines:
|
|
|
|
- `PARALLELISM observed: mac-1 and ubuntu-1 both holding active jobs concurrently` — guarantee (c).
|
|
- `killed factory mac-1 ... mid-job` then `reaper reclaimed mac-1's lease(s)` — the crash + reclaim.
|
|
- `zombie report for <job> @epoch=N was FENCED (HTTP 409)` — guarantee (b) fencing.
|
|
- `RESULTS` shows each job's winning factory; the reclaimed job's winner is the **survivor**.
|
|
|
|
With `DEMO_KEEP=1`, inspect under the printed temp dir:
|
|
|
|
- `coord/events.log` — the coordinator's audit trail: `CLAIM` / `PATCH:<stage>` / `RECLAIM` / `FENCE` events (factory + epoch on each).
|
|
- `coord/jobs/<id>.job` — final per-job `stage` / `holder` / `epoch`.
|
|
- `log-mac-1.txt`, `log-ubuntu-1.txt` — each factory's run-loop log (claims, the `▶ launching`, the fenced/quarantine path on the killed worker).
|
|
|
|
## Files
|
|
|
|
- `two-factory-demo.sh` — the orchestrator (start factories, kill/reclaim/fence, assert).
|
|
- `coordinator-stub.sh` — the stateful coordinator stub (claim/patch/fence/renew/release/reap, mkdir-locked).
|
|
- `start-fleet.example.sh` — reference launcher for a **real** multi-product local
|
|
fleet against a live platform-service (one `agent-queue.sh run` daemon per
|
|
product). Parameterized via env; ships the two settings you must get right —
|
|
`AQ_FLEET_GATE=1` (M0 RU gate) and `AQ_FLEET_LEASE_RENEW_SEC=30` (heartbeat
|
|
cadence < the 90s stale threshold). Copy + adjust for your sandbox.
|