docs(gigafactory): v4 coverage audit — roadmap maps 1:1 to design, no gaps

Adds a coverage matrix + M-prep (decisions/§10, schema, containers, RBAC) and
closes plan gaps: correlation filter + dispatcher budget (M1); small messages,
token re-check, alerting (M2); plus Testing and Rollback & flags blocks.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
saravanakumardb1 2026-05-31 22:47:54 -07:00
parent 9f24a7fdd0
commit 29afe59604

View File

@ -20,6 +20,12 @@
> §5.3 with the complete-on-claim model (broker is not the redelivery path), > §5.3 with the complete-on-claim model (broker is not the redelivery path),
> aligned §6 token scoping with per-factory subscriptions, and added the GC / > aligned §6 token scoping with per-factory subscriptions, and added the GC /
> `POST /fleet/fail` checklist block to §12. > `POST /fleet/fail` checklist block to §12.
> - v4 (2026-05-31): **coverage audit** — roadmap now maps 1:1 to the design via a
> coverage matrix. Closed plan gaps: **M-prep** (decisions/§10 + schema +
> containers + RBAC), correlation filter + dispatcher budget enforcement (M1),
> small-messages/body-from-Cosmos + token re-check + alerting (M2), and new
> **Testing** and **Rollback & flags** blocks. No design element is now without
> an implementation step.
--- ---
@ -408,6 +414,39 @@ Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the
assignment + crash recovery still hold on multi-host, and every step is assignment + crash recovery still hold on multi-host, and every step is
flag-gated + reversible. flag-gated + reversible.
### Coverage matrix (design → plan)
Every design element maps to a checklist block below — no design decision is left
without an implementation step.
| Design element | §ref | Plan block |
| --- | --- | --- |
| Idle-poll RU bleed | §1.2 | M0 |
| Product-as-queue / wrong-factory | §1.1, §5.4 | Routing-model fix |
| Open questions / decisions | §10 | M-prep |
| Schema, containers, RBAC | §4, §5, §6 | M-prep |
| Service Bus topic + subscriptions + filters | §4.2 | M-prep, M1 |
| Change-feed dispatcher + scheduler + targeting | §4.3, §5.1 | M1 |
| Budget enforcement at assign | §6 | M1 |
| Claim/fence + complete-on-claim | §5.2 | M2 |
| Small messages (body from Cosmos) | §4.2 | M2 |
| Token re-check at claim | §6 | M2 |
| Metrics + alerting | §9 | M2 |
| Failure→lease release, GC, same-repo clobber | §5.5 | Error handling & cleanup |
| Scale-to-zero on-demand | §3, §5.1 | M3 |
| Tests (dispatcher, CAS/fencing, GC, shadow) | §9 | Testing |
| Rollback / flags per phase | §3 | Rollback & flags |
| Doc updates | — | Docs to update |
### M-prep — Decisions & schema (closes §10; before M1)
- [ ] Lock dispatcher placement (platform-service loop vs separate worker) + **leader election** so a single active dispatcher avoids double-publish (§10 Q2).
- [ ] Lock Service Bus tier (Standard default; Premium only for sessions / large messages / VNet) (§10 Q3).
- [ ] Lock subscription model (per-factory correlation filter default; single-subscription SQL filter if factory churn is high) (§10 Q1).
- [ ] Confirm the Cosmos poll path stays as a **permanent** flag-gated fallback (`AQ_FLEET_ROUTE=0`) (§10 Q4).
- [ ] Confirm repo-advertisement source: `repos[]` in the heartbeat, derived from `AQ_FLEET_REPO_BASE` (§10 Q5).
- [ ] Schema: add `targetFactoryId` to `FleetJobDoc`, `repos[]` to `FleetFactoryDoc`; register a new `fleet_queue_state` doc (`/productId`) for the M0 gate; provision the change-feed **lease container**; update the container registry / `COSMOS_AUTO_INIT`.
- [ ] RBAC via managed identity: dispatcher = Service Bus **Sender**, factories = **Listener** on their own subscription; no shared keys committed.
### M0 — RU quick win (no new infra) ### M0 — RU quick win (no new infra)
- [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change. - [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change.
- [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed. - [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed.
@ -422,16 +461,19 @@ flag-gated + reversible.
- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net. - [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
### M1 — Broker in shadow ### M1 — Broker in shadow
- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions (managed identity, no keys). - [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions, each with a **correlation filter** `targetFactoryId='<id>'` (managed identity, no keys).
- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, publishes targeted messages (`MessageId=jobId`, dup-detection on). - [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, stamps `targetFactoryId` (CAS), publishes targeted messages (`MessageId=jobId`, dup-detection on).
- [ ] Dispatcher enforces per-product **budget** (paused / ceiling) before publishing (relocates the `claimNextJob` budget check, §6).
- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken). - [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance. - [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
### M2 — Cutover delivery ### M2 — Cutover delivery
- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`. - [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
- [ ] Messages carry `{jobId, productId, repo, caps, priority, targetFactoryId}` only; the consumer **reads the full job body from Cosmos** by `jobId` (256 KB limit, §4.2).
- [ ] `/fleet/accept` (and `/fleet/claim`) **re-checks the §12 factory token** (productId + caps + factoryId) before granting the lease.
- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness). - [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`). - [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag. - [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag**and wire alerts** (DLQ depth > 0, dispatch lag > threshold) into existing monitoring.
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct. - [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
### Error handling & cleanup (lands with M2) — see §5.5 ### Error handling & cleanup (lands with M2) — see §5.5
@ -445,6 +487,15 @@ flag-gated + reversible.
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters. - [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target. - [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
### Testing (every phase — tests are sacred)
- [ ] Unit: dispatcher scheduling + publish, claim CAS + `leaseEpoch` fencing, `/fleet/fail`, GC idempotency, the M0 gate read/skip logic.
- [ ] Integration: shadow-divergence harness (M1), exactly-once + crash recovery (M2), scale-to-zero behavior (M3).
- [ ] Extend `agent-queue/selftest.sh` + platform-service `vitest`; **CI green is the gate** to advance each phase.
### Rollback & flags (per phase)
- [ ] Each phase ships behind a flag with a documented one-line rollback: M0 `AQ_FLEET_GATE`, M1 shadow (publishes but never acts), M2 `AQ_FLEET_ROUTE` / broker-source toggle, M3 scaler disable.
- [ ] Verify each rollback returns to the prior working path with **no data loss** and no stranded leases/messages.
### Docs to update on completion ### Docs to update on completion
- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table. - [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map. - [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.