docs(gigafactory): v4 coverage audit — roadmap maps 1:1 to design, no gaps
Adds a coverage matrix + M-prep (decisions/§10, schema, containers, RBAC) and closes plan gaps: correlation filter + dispatcher budget (M1); small messages, token re-check, alerting (M2); plus Testing and Rollback & flags blocks. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
parent
9f24a7fdd0
commit
29afe59604
@ -20,6 +20,12 @@
|
|||||||
> §5.3 with the complete-on-claim model (broker is not the redelivery path),
|
> §5.3 with the complete-on-claim model (broker is not the redelivery path),
|
||||||
> aligned §6 token scoping with per-factory subscriptions, and added the GC /
|
> aligned §6 token scoping with per-factory subscriptions, and added the GC /
|
||||||
> `POST /fleet/fail` checklist block to §12.
|
> `POST /fleet/fail` checklist block to §12.
|
||||||
|
> - v4 (2026-05-31): **coverage audit** — roadmap now maps 1:1 to the design via a
|
||||||
|
> coverage matrix. Closed plan gaps: **M-prep** (decisions/§10 + schema +
|
||||||
|
> containers + RBAC), correlation filter + dispatcher budget enforcement (M1),
|
||||||
|
> small-messages/body-from-Cosmos + token re-check + alerting (M2), and new
|
||||||
|
> **Testing** and **Rollback & flags** blocks. No design element is now without
|
||||||
|
> an implementation step.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -408,6 +414,39 @@ Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the
|
|||||||
assignment + crash recovery still hold on multi-host, and every step is
|
assignment + crash recovery still hold on multi-host, and every step is
|
||||||
flag-gated + reversible.
|
flag-gated + reversible.
|
||||||
|
|
||||||
|
### Coverage matrix (design → plan)
|
||||||
|
|
||||||
|
Every design element maps to a checklist block below — no design decision is left
|
||||||
|
without an implementation step.
|
||||||
|
|
||||||
|
| Design element | §ref | Plan block |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Idle-poll RU bleed | §1.2 | M0 |
|
||||||
|
| Product-as-queue / wrong-factory | §1.1, §5.4 | Routing-model fix |
|
||||||
|
| Open questions / decisions | §10 | M-prep |
|
||||||
|
| Schema, containers, RBAC | §4, §5, §6 | M-prep |
|
||||||
|
| Service Bus topic + subscriptions + filters | §4.2 | M-prep, M1 |
|
||||||
|
| Change-feed dispatcher + scheduler + targeting | §4.3, §5.1 | M1 |
|
||||||
|
| Budget enforcement at assign | §6 | M1 |
|
||||||
|
| Claim/fence + complete-on-claim | §5.2 | M2 |
|
||||||
|
| Small messages (body from Cosmos) | §4.2 | M2 |
|
||||||
|
| Token re-check at claim | §6 | M2 |
|
||||||
|
| Metrics + alerting | §9 | M2 |
|
||||||
|
| Failure→lease release, GC, same-repo clobber | §5.5 | Error handling & cleanup |
|
||||||
|
| Scale-to-zero on-demand | §3, §5.1 | M3 |
|
||||||
|
| Tests (dispatcher, CAS/fencing, GC, shadow) | §9 | Testing |
|
||||||
|
| Rollback / flags per phase | §3 | Rollback & flags |
|
||||||
|
| Doc updates | — | Docs to update |
|
||||||
|
|
||||||
|
### M-prep — Decisions & schema (closes §10; before M1)
|
||||||
|
- [ ] Lock dispatcher placement (platform-service loop vs separate worker) + **leader election** so a single active dispatcher avoids double-publish (§10 Q2).
|
||||||
|
- [ ] Lock Service Bus tier (Standard default; Premium only for sessions / large messages / VNet) (§10 Q3).
|
||||||
|
- [ ] Lock subscription model (per-factory correlation filter default; single-subscription SQL filter if factory churn is high) (§10 Q1).
|
||||||
|
- [ ] Confirm the Cosmos poll path stays as a **permanent** flag-gated fallback (`AQ_FLEET_ROUTE=0`) (§10 Q4).
|
||||||
|
- [ ] Confirm repo-advertisement source: `repos[]` in the heartbeat, derived from `AQ_FLEET_REPO_BASE` (§10 Q5).
|
||||||
|
- [ ] Schema: add `targetFactoryId` to `FleetJobDoc`, `repos[]` to `FleetFactoryDoc`; register a new `fleet_queue_state` doc (`/productId`) for the M0 gate; provision the change-feed **lease container**; update the container registry / `COSMOS_AUTO_INIT`.
|
||||||
|
- [ ] RBAC via managed identity: dispatcher = Service Bus **Sender**, factories = **Listener** on their own subscription; no shared keys committed.
|
||||||
|
|
||||||
### M0 — RU quick win (no new infra)
|
### M0 — RU quick win (no new infra)
|
||||||
- [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change.
|
- [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change.
|
||||||
- [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed.
|
- [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed.
|
||||||
@ -422,16 +461,19 @@ flag-gated + reversible.
|
|||||||
- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
|
- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
|
||||||
|
|
||||||
### M1 — Broker in shadow
|
### M1 — Broker in shadow
|
||||||
- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions (managed identity, no keys).
|
- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions, each with a **correlation filter** `targetFactoryId='<id>'` (managed identity, no keys).
|
||||||
- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, publishes targeted messages (`MessageId=jobId`, dup-detection on).
|
- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, stamps `targetFactoryId` (CAS), publishes targeted messages (`MessageId=jobId`, dup-detection on).
|
||||||
|
- [ ] Dispatcher enforces per-product **budget** (paused / ceiling) before publishing (relocates the `claimNextJob` budget check, §6).
|
||||||
- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
|
- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
|
||||||
- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
|
- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
|
||||||
|
|
||||||
### M2 — Cutover delivery
|
### M2 — Cutover delivery
|
||||||
- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
|
- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
|
||||||
|
- [ ] Messages carry `{jobId, productId, repo, caps, priority, targetFactoryId}` only; the consumer **reads the full job body from Cosmos** by `jobId` (256 KB limit, §4.2).
|
||||||
|
- [ ] `/fleet/accept` (and `/fleet/claim`) **re-checks the §12 factory token** (productId + caps + factoryId) before granting the lease.
|
||||||
- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
|
- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
|
||||||
- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
|
- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
|
||||||
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
|
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag — **and wire alerts** (DLQ depth > 0, dispatch lag > threshold) into existing monitoring.
|
||||||
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
|
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
|
||||||
|
|
||||||
### Error handling & cleanup (lands with M2) — see §5.5
|
### Error handling & cleanup (lands with M2) — see §5.5
|
||||||
@ -445,6 +487,15 @@ flag-gated + reversible.
|
|||||||
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
|
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
|
||||||
- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
|
- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
|
||||||
|
|
||||||
|
### Testing (every phase — tests are sacred)
|
||||||
|
- [ ] Unit: dispatcher scheduling + publish, claim CAS + `leaseEpoch` fencing, `/fleet/fail`, GC idempotency, the M0 gate read/skip logic.
|
||||||
|
- [ ] Integration: shadow-divergence harness (M1), exactly-once + crash recovery (M2), scale-to-zero behavior (M3).
|
||||||
|
- [ ] Extend `agent-queue/selftest.sh` + platform-service `vitest`; **CI green is the gate** to advance each phase.
|
||||||
|
|
||||||
|
### Rollback & flags (per phase)
|
||||||
|
- [ ] Each phase ships behind a flag with a documented one-line rollback: M0 `AQ_FLEET_GATE`, M1 shadow (publishes but never acts), M2 `AQ_FLEET_ROUTE` / broker-source toggle, M3 scaler disable.
|
||||||
|
- [ ] Verify each rollback returns to the prior working path with **no data loss** and no stranded leases/messages.
|
||||||
|
|
||||||
### Docs to update on completion
|
### Docs to update on completion
|
||||||
- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
|
- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
|
||||||
- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.
|
- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user