docs(agent-queue): fleet integration section + roadmap P2-S3 ticks
README: "Fleet integration (Phase 2)" — AQ_FLEET flag, env table, claim/heartbeat/ report/fence/renew protocol, offline-degrade + quarantine, offline-vs-fleet explainer. Roadmap: tick the Phase-2 §14 factory-agent item, add a P2-S3 slice note, bump §0 Phase 2 -> 55%.
This commit is contained in:
parent
064dbf3d8f
commit
21ebf8b1b7
@ -378,6 +378,61 @@ execution**: an echo failure is logged and the job continues unchanged. With
|
||||
`AQ_TRACKER_AUTO=1` the worker echoes automatically on each transition; otherwise
|
||||
echo is manual. `status` / `insights` surface the `tracker-item` and last echoed status.
|
||||
|
||||
## Fleet integration (Phase 2)
|
||||
|
||||
Behind the `AQ_FLEET` flag, the runner becomes a **factory** that registers,
|
||||
heartbeats, claims, and reports against the platform-service `fleet` coordinator —
|
||||
so coordinator jobs run alongside local `.md` files on the same host. All
|
||||
coordinator logic lives in [`lib/fleet-client.sh`](lib/fleet-client.sh) (curl-only +
|
||||
POSIX awk, sourced by `agent-queue.sh`); the few hook points in the runner are all
|
||||
gated on `fleet_enabled`.
|
||||
|
||||
> **Offline vs fleet mode.** With `AQ_FLEET` unset/`0` (the default) the runner is
|
||||
> the pure offline git-queue described above — **zero** coordinator calls, behavior
|
||||
> byte-for-byte unchanged. With `AQ_FLEET=1` the run loop also registers + claims
|
||||
> from the coordinator, reports fenced stage transitions, renews leases, and (in
|
||||
> fleet mode) routes the outcome echo through the coordinator's `fleet_events`
|
||||
> instead of the direct tracker echo. The tracker echo remains the offline path.
|
||||
|
||||
```bash
|
||||
AQ_FLEET=1 AQ_FLEET_TOKEN=… AQ_PRODUCT_ID=… agent-queue.sh fleet-status # register + show identity
|
||||
AQ_FLEET=1 AQ_FLEET_TOKEN=… AQ_PRODUCT_ID=… agent-queue.sh run # claim + execute coordinator jobs
|
||||
```
|
||||
|
||||
### Config (env)
|
||||
|
||||
| Var | Default | Meaning |
|
||||
| --- | ------- | ------- |
|
||||
| `AQ_FLEET` | `0` | master switch — `1` enables coordinator integration; `0`/unset = offline git-queue |
|
||||
| `AQ_FLEET_API` | `http://localhost:4003/api` | coordinator base URL (already includes `/api`) |
|
||||
| `AQ_FLEET_TOKEN` | _(none)_ | bearer token — never hardcode |
|
||||
| `AQ_PRODUCT_ID` | _(none)_ | productId (sent as `X-Product-Id`; shared with the tracker config) |
|
||||
| `AQ_FACTORY_ID` | `<hostname>-<pid>` | stable factory identity for this process |
|
||||
| `AQ_FLEET_LEASE_RENEW_SEC` | `300` | heartbeat / lease-renew cadence |
|
||||
| `AQ_FLEET_CAPS` | _(auto)_ | override the auto-detected capability tokens (comma/space list) |
|
||||
| `AQ_FLEET_CWD` | `$PWD` | cwd a claimed coordinator job runs in |
|
||||
| `AQ_FLEET_API_CMD` | _(none)_ | test seam: a stub that replaces the curl HTTP entirely (selftest uses it) |
|
||||
|
||||
### Protocol (claim / heartbeat / report / fence / renew)
|
||||
|
||||
- **register / heartbeat:** `POST /fleet/factories/heartbeat {factoryId, capabilities[], health, load}` — registration *is* the first heartbeat; re-sent on `AQ_FLEET_LEASE_RENEW_SEC` cadence.
|
||||
- **claim:** `POST /fleet/claim {factoryId, capabilities[], leaseSeconds}`. A returned job (`id`, `bodyMd`, `leaseEpoch`) is materialized as a transient local `.md` (frontmatter `fleet-job-id` + `fleet-lease-epoch`) so the existing runner executes it unchanged, interleaved with local files.
|
||||
- **report (fenced):** each stage transition (`building`/`review`/`testing`/`shipped`/`failed`) is `PATCH /fleet/jobs/:id {stage, leaseEpoch, checkpoint?}`. The coordinator writes `fleet_events` server-side. The payload carries only stage/epoch/checkpoint — **never** the prompt/`bodyMd` or token.
|
||||
- **fencing (§18):** if a report/renew returns **conflict/409** (stale `leaseEpoch` → the coordinator reclaimed us), the worker **self-aborts**: it stops, does **not** ship/merge, and **quarantines** the local result to `failed/` (`result=fenced_quarantine`) for human triage. A reclaimed zombie can never corrupt coordinator state.
|
||||
- **lease renew / release:** `POST /fleet/jobs/:id/lease/renew` while building (fenced); `…/lease/release` on terminal stages.
|
||||
- **checkpoint:** the WIP `{wipBranch, wipCommit}` is sent with the building report so a reclaim can resume (§25).
|
||||
|
||||
### Offline-degrade + quarantine (§9)
|
||||
|
||||
If the coordinator is **unreachable** mid-job (5xx / connection error), the report
|
||||
is treated as *degraded* (logged, `fleet_degraded=1`): the in-flight job **finishes
|
||||
locally** rather than being abandoned. On the next reachable call the worker
|
||||
presents its `leaseEpoch`; if the coordinator now reports it **stale** (it was
|
||||
reclaimed during the outage), the local result is **quarantined** (marked, not
|
||||
auto-shipped) and surfaced for human triage — split-brain is resolved in favor of
|
||||
the coordinator without losing the work. `status` shows the factory id + per-job
|
||||
`fleet=<id>@e<epoch>`; `insights` lists the `fleet_*` fields.
|
||||
|
||||
## Config (env overrides)
|
||||
|
||||
| Var | Default | Meaning |
|
||||
|
||||
@ -12,7 +12,7 @@
|
||||
| ----- | ----- | ------ | - | ---- |
|
||||
| **0** | Baseline (today) | ✅ shipped | 100% | `selftest.sh` green |
|
||||
| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 95% | adapter e2e + selftest |
|
||||
| **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ☐ not started | 0% | fleet e2e + module tests |
|
||||
| **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ◐ in progress | 55% | fleet e2e + module tests |
|
||||
| **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests |
|
||||
| **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite |
|
||||
| **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B |
|
||||
@ -367,10 +367,20 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until
|
||||
### Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing
|
||||
**Goal:** the service spine; ≥2 real factories executing in parallel via leases.
|
||||
|
||||
> **Slice progress — P2-S3 (factory-agent integration, single host):** the bash runner
|
||||
> is now a coordinator **factory** behind `AQ_FLEET` — `lib/fleet-client.sh` (curl-only,
|
||||
> sourced) registers via heartbeat, claims jobs into inbox (interleaved with local `.md`),
|
||||
> reports **fenced** stage transitions with WIP checkpoints, renews/releases leases, and on
|
||||
> a stale `leaseEpoch` (reclaimed) **self-aborts + quarantines** the local result. Coordinator
|
||||
> 5xx/connection errors **degrade** (finish locally) rather than abandon work. When `AQ_FLEET`
|
||||
> is off the offline git-queue path is byte-for-byte unchanged. Remaining P2: scheduler/router
|
||||
> core, direct tracker→module calls, factory enrollment + scoped tokens, `fleet.*` feature
|
||||
> flags + shadow/dual-run, and the two-factory parallel demo (the Phase-2 exit criteria).
|
||||
|
||||
- [x] Scaffold `fleet`/`orchestrator` module in `platform-service` (`types/repository/routes`, Zod, ESM, `productId`). *(PR #28)*
|
||||
- [x] Cosmos containers (§13) + repository layer (memory + Cosmos providers). *(PR #28; `fleet_artifacts` blob wiring still pending.)*
|
||||
- [x] **Atomic claim** (optimistic concurrency / `_etag`) + **lease reaper** + **fencing (`leaseEpoch`)** endpoints (§4/§8/§9) — *not* Cosmos-TTL-driven reclaim. *(common-plat PR #28 + #29; truly atomic via `updateIfMatch`.)*
|
||||
- [ ] Port `agent-queue` runner to a **factory agent** API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback.
|
||||
- [x] Port `agent-queue` runner to a **factory agent** API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback. *(P2-S3: `lib/fleet-client.sh` behind `AQ_FLEET`; registers via heartbeat, claims into inbox, reports fenced stage transitions, renews leases, quarantines on stale-epoch; offline git-queue unchanged when the flag is off.)*
|
||||
- [ ] Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment.
|
||||
- [ ] Tracker adapter calls the module directly (not just file export).
|
||||
- [ ] Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset).
|
||||
|
||||
Loading…
Reference in New Issue
Block a user