diff --git a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md index 5cf858d..51b7c7e 100644 --- a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md +++ b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md @@ -20,6 +20,12 @@ > §5.3 with the complete-on-claim model (broker is not the redelivery path), > aligned §6 token scoping with per-factory subscriptions, and added the GC / > `POST /fleet/fail` checklist block to §12. +> - v4 (2026-05-31): **coverage audit** — roadmap now maps 1:1 to the design via a +> coverage matrix. Closed plan gaps: **M-prep** (decisions/§10 + schema + +> containers + RBAC), correlation filter + dispatcher budget enforcement (M1), +> small-messages/body-from-Cosmos + token re-check + alerting (M2), and new +> **Testing** and **Rollback & flags** blocks. No design element is now without +> an implementation step. --- @@ -408,6 +414,39 @@ Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the assignment + crash recovery still hold on multi-host, and every step is flag-gated + reversible. +### Coverage matrix (design → plan) + +Every design element maps to a checklist block below — no design decision is left +without an implementation step. + +| Design element | §ref | Plan block | +| --- | --- | --- | +| Idle-poll RU bleed | §1.2 | M0 | +| Product-as-queue / wrong-factory | §1.1, §5.4 | Routing-model fix | +| Open questions / decisions | §10 | M-prep | +| Schema, containers, RBAC | §4, §5, §6 | M-prep | +| Service Bus topic + subscriptions + filters | §4.2 | M-prep, M1 | +| Change-feed dispatcher + scheduler + targeting | §4.3, §5.1 | M1 | +| Budget enforcement at assign | §6 | M1 | +| Claim/fence + complete-on-claim | §5.2 | M2 | +| Small messages (body from Cosmos) | §4.2 | M2 | +| Token re-check at claim | §6 | M2 | +| Metrics + alerting | §9 | M2 | +| Failure→lease release, GC, same-repo clobber | §5.5 | Error handling & cleanup | +| Scale-to-zero on-demand | §3, §5.1 | M3 | +| Tests (dispatcher, CAS/fencing, GC, shadow) | §9 | Testing | +| Rollback / flags per phase | §3 | Rollback & flags | +| Doc updates | — | Docs to update | + +### M-prep — Decisions & schema (closes §10; before M1) +- [ ] Lock dispatcher placement (platform-service loop vs separate worker) + **leader election** so a single active dispatcher avoids double-publish (§10 Q2). +- [ ] Lock Service Bus tier (Standard default; Premium only for sessions / large messages / VNet) (§10 Q3). +- [ ] Lock subscription model (per-factory correlation filter default; single-subscription SQL filter if factory churn is high) (§10 Q1). +- [ ] Confirm the Cosmos poll path stays as a **permanent** flag-gated fallback (`AQ_FLEET_ROUTE=0`) (§10 Q4). +- [ ] Confirm repo-advertisement source: `repos[]` in the heartbeat, derived from `AQ_FLEET_REPO_BASE` (§10 Q5). +- [ ] Schema: add `targetFactoryId` to `FleetJobDoc`, `repos[]` to `FleetFactoryDoc`; register a new `fleet_queue_state` doc (`/productId`) for the M0 gate; provision the change-feed **lease container**; update the container registry / `COSMOS_AUTO_INIT`. +- [ ] RBAC via managed identity: dispatcher = Service Bus **Sender**, factories = **Listener** on their own subscription; no shared keys committed. + ### M0 — RU quick win (no new infra) - [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change. - [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed. @@ -422,16 +461,19 @@ flag-gated + reversible. - [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net. ### M1 — Broker in shadow -- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions (managed identity, no keys). -- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, publishes targeted messages (`MessageId=jobId`, dup-detection on). +- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions, each with a **correlation filter** `targetFactoryId=''` (managed identity, no keys). +- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, stamps `targetFactoryId` (CAS), publishes targeted messages (`MessageId=jobId`, dup-detection on). +- [ ] Dispatcher enforces per-product **budget** (paused / ceiling) before publishing (relocates the `claimNextJob` budget check, §6). - [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken). - [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance. ### M2 — Cutover delivery - [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`. +- [ ] Messages carry `{jobId, productId, repo, caps, priority, targetFactoryId}` only; the consumer **reads the full job body from Cosmos** by `jobId` (256 KB limit, §4.2). +- [ ] `/fleet/accept` (and `/fleet/claim`) **re-checks the §12 factory token** (productId + caps + factoryId) before granting the lease. - [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness). - [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`). -- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag. +- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag — **and wire alerts** (DLQ depth > 0, dispatch lag > threshold) into existing monitoring. - [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct. ### Error handling & cleanup (lands with M2) — see §5.5 @@ -445,6 +487,15 @@ flag-gated + reversible. - [ ] Optional warm-pool (1 small instance) if cold-start latency matters. - [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target. +### Testing (every phase — tests are sacred) +- [ ] Unit: dispatcher scheduling + publish, claim CAS + `leaseEpoch` fencing, `/fleet/fail`, GC idempotency, the M0 gate read/skip logic. +- [ ] Integration: shadow-divergence harness (M1), exactly-once + crash recovery (M2), scale-to-zero behavior (M3). +- [ ] Extend `agent-queue/selftest.sh` + platform-service `vitest`; **CI green is the gate** to advance each phase. + +### Rollback & flags (per phase) +- [ ] Each phase ships behind a flag with a documented one-line rollback: M0 `AQ_FLEET_GATE`, M1 shadow (publishes but never acts), M2 `AQ_FLEET_ROUTE` / broker-source toggle, M3 scaler disable. +- [ ] Verify each rollback returns to the prior working path with **no data loss** and no stranded leases/messages. + ### Docs to update on completion - [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table. - [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.