Close the final Phase-2 exit-criteria box: >=2 factories executing jobs in parallel through one coordinator, proving the concurrency guarantees end-to-end. This is a DEMO HARNESS over the existing runtime — agent-queue.sh and lib/fleet-client.sh are unchanged (read + called, not modified). demo/two-factory-demo.sh: starts two real `agent-queue.sh run` daemons (mac-1 + ubuntu-1, separate queues/cwds) that compete ONLY through the coordinator, then asserts: (a) no double-assign — each of 3 jobs executed by exactly one factory; (b) fencing + reclaim — kill a factory mid-job, the reaper returns its job, the survivor reclaims + completes it, and the dead worker's late/zombie report (stale leaseEpoch) is FENCED (HTTP 409, never shipped); (c) parallelism — both factories hold active jobs concurrently. Dual-mode: CI-safe stateful stub by default; live platform-service when AQ_FLEET_API/AQ_FLEET_TOKEN set. demo/coordinator-stub.sh: stateful, mkdir-lock-guarded, file-backed coordinator implementing claim/lease/fence/renew/release + reaper-reclaim via the existing AQ_FLEET_API_CMD seam — the selftest stub pattern extended with shared state so >=2 processes coordinate through one coordinator. demo/README.md: stub + real invocations, env knobs, what each guarantee proves, what-to-watch guide. selftest.sh: +3 headless stub-mode checks (existing 68 unchanged byte-for-byte -> 71 total green). docs/GIGAFACTORY_ROADMAP.md: tick the §14 two-factory-demo box; annotate Phase-2 exit criteria; bump §0 Phase 2 to 80% (remaining: scheduler-core wiring [common-plat PR #31], tracker-direct call, factory enrollment). bash 3.2 + awk/sed/grep/pgrep only; mac+linux safe; no new runtime deps. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
79 lines
4.2 KiB
Markdown
79 lines
4.2 KiB
Markdown
# Two-Factory Parallel Demo (Phase-2 Exit Criteria, §14)
|
|
|
|
This demo closes the final Phase-2 exit-criteria box: **≥2 factories executing jobs in
|
|
parallel through one coordinator**, proving the concurrency guarantees end-to-end. It is a
|
|
**harness over the existing runtime** — it does *not* change `agent-queue.sh` or
|
|
`lib/fleet-client.sh`; it starts two real `agent-queue.sh run` daemons (distinct
|
|
factoryIds, separate queues/cwds) that compete **only** through the coordinator, then
|
|
observes and asserts.
|
|
|
|
## The three guarantees it proves
|
|
|
|
| # | Guarantee | How it's shown |
|
|
|---|-----------|----------------|
|
|
| **(a)** | **No double-assign** | Each of the 3 jobs is claimed/executed by exactly **one** factory. The coordinator's atomic claim (lock-guarded; only a `queued` job is claimable) means two concurrent claimers never get the same job version. |
|
|
| **(b)** | **Fencing + reclaim** | One factory is **killed mid-job**. The reaper returns its in-flight job to `queued` with a **bumped lease epoch**; the surviving factory **reclaims and completes** it. The dead worker's late/zombie report (stale epoch) is **fenced (HTTP 409)** and never ships. |
|
|
| **(c)** | **Parallelism** | Both factories hold an active job **simultaneously** (observed in coordinator state) — work is concurrent, not serialized. |
|
|
|
|
## Run it
|
|
|
|
### Stub mode (default, zero dependencies, CI-safe)
|
|
|
|
```bash
|
|
bash demo/two-factory-demo.sh
|
|
```
|
|
|
|
Drives [`coordinator-stub.sh`](coordinator-stub.sh) — a stateful, lock-guarded, file-backed
|
|
coordinator that implements the same claim / lease / fence / reaper contract as
|
|
platform-service, via the existing `AQ_FLEET_API_CMD` test seam. No platform-service, no
|
|
Cosmos, no network. This is exactly what `selftest.sh` runs headlessly.
|
|
|
|
### Real-coordinator mode (against a live platform-service)
|
|
|
|
```bash
|
|
DEMO_MODE=real \
|
|
AQ_FLEET_API=http://localhost:4003/api \
|
|
AQ_FLEET_TOKEN=<bearer> \
|
|
AQ_PRODUCT_ID=<product> \
|
|
bash demo/two-factory-demo.sh
|
|
```
|
|
|
|
In real mode the demo submits via the platform-service fleet API and relies on the
|
|
coordinator's **own lease reaper** to reclaim the killed factory's job (it waits
|
|
`DEMO_REAP_WAIT` seconds; pair with a short `AQ_FLEET_LEASE_SECONDS` so the lease expires
|
|
quickly). Submit endpoint is overridable via `DEMO_SUBMIT_PATH` (default `/fleet/jobs`).
|
|
Real mode is observational/best-effort — the machine-checked assertions run in stub mode
|
|
(and in `selftest.sh`).
|
|
|
|
## Env knobs
|
|
|
|
| Var | Default | Meaning |
|
|
|-----|---------|---------|
|
|
| `DEMO_MODE` | `stub` | `stub` or `real` (auto-set to `real` when `AQ_FLEET_API`+`AQ_FLEET_TOKEN` are set and `DEMO_MODE` ≠ `stub`) |
|
|
| `DEMO_JOB_SLEEP` | `2` | per-job engine seconds — the window during which the victim is killed mid-job |
|
|
| `DEMO_TIMEOUT` | `60` | max seconds to wait for the survivor to drain all 3 jobs |
|
|
| `DEMO_POLL` | `0.2` | coordinator-state poll interval |
|
|
| `DEMO_FACTORY_1` / `DEMO_FACTORY_2` | `mac-1` / `ubuntu-1` | factory ids (F1 is the victim) |
|
|
| `DEMO_KEEP` | `0` | `1` keeps the temp dir (queues, logs, coordinator state) for inspection |
|
|
| `DEMO_REAP_WAIT` / `DEMO_DRAIN_WAIT` | `20` / `30` | real-mode waits for the coordinator reaper / drain |
|
|
|
|
## What to watch
|
|
|
|
The demo prints a step-by-step trace and a final `RESULTS` block. The key lines:
|
|
|
|
- `PARALLELISM observed: mac-1 and ubuntu-1 both holding active jobs concurrently` — guarantee (c).
|
|
- `killed factory mac-1 ... mid-job` then `reaper reclaimed mac-1's lease(s)` — the crash + reclaim.
|
|
- `zombie report for <job> @epoch=N was FENCED (HTTP 409)` — guarantee (b) fencing.
|
|
- `RESULTS` shows each job's winning factory; the reclaimed job's winner is the **survivor**.
|
|
|
|
With `DEMO_KEEP=1`, inspect under the printed temp dir:
|
|
|
|
- `coord/events.log` — the coordinator's audit trail: `CLAIM` / `PATCH:<stage>` / `RECLAIM` / `FENCE` events (factory + epoch on each).
|
|
- `coord/jobs/<id>.job` — final per-job `stage` / `holder` / `epoch`.
|
|
- `log-mac-1.txt`, `log-ubuntu-1.txt` — each factory's run-loop log (claims, the `▶ launching`, the fenced/quarantine path on the killed worker).
|
|
|
|
## Files
|
|
|
|
- `two-factory-demo.sh` — the orchestrator (start factories, kill/reclaim/fence, assert).
|
|
- `coordinator-stub.sh` — the stateful coordinator stub (claim/patch/fence/renew/release/reap, mkdir-locked).
|