From 8217a864e9e04d4305d6d383f3fbf3c14f15e818 Mon Sep 17 00:00:00 2001
From: saravanakumardb1 <saravanakumardb1@users.noreply.github.com>
Date: Sun, 31 May 2026 22:24:31 -0700
Subject: [PATCH] docs(gigafactory): add Phase-4 fleet dispatch redesign
 (broker + on-demand)

Proposes moving fleet work-dispatch off Cosmos busy-polling onto Azure Service
Bus in a coordinator-owns-scheduling / broker-owns-delivery hybrid, fixing the
product-as-queue routing smell and the idle-poll RU cost. Includes phased
migration (M0 RU quick win -> shadow -> cutover -> scale-to-zero) with a ticked
checklist. Self-reviewed (v2) for the outbox/change-feed, message-size,
long-job lock, idempotency, and routing-model consistency issues.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
---
 .../GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md    | 392 ++++++++++++++++++
 agent-queue/docs/GIGAFACTORY/README.md        |   1 +
 2 files changed, 393 insertions(+)
 create mode 100644 agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md

diff --git a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
new file mode 100644
index 0000000..f430972
--- /dev/null
+++ b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
@@ -0,0 +1,392 @@
+# Fleet Dispatch Redesign — Broker-Backed, On-Demand Factories
+
+> Design proposal (no code yet). Companion to `GIGAFACTORY_SYSTEM_OVERVIEW.md`
+> (what exists today) and `GIGAFACTORY_ROADMAP.md` (source-of-truth spec). This
+> doc realizes roadmap **Phase 4** ("Message bus + autoscaling") and the
+> routing-model cleanup that comes with it. Last reviewed: **2026-05-31**.
+>
+> **Review log**
+> - v1 (2026-05-31): initial proposal.
+> - v2 (2026-05-31): self-review pass — reconciled the routing model
+>   (coordinator-targeted as primary), fixed the Cosmos outbox transactionality
+>   claim (change feed *is* the log), constrained message size (jobId + routing
+>   props only), addressed long-job vs Service Bus 5-min lock, corrected the
+>   idempotency key (`MessageId = jobId`), renamed migration steps `M0–M3` to
+>   avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
+>   ticked roadmap checklist + auth/observability notes.
+
+---
+
+## 1. Why this doc exists (the two smells)
+
+Two structural problems surfaced while running the local fleet against
+`tracker-web` + `platform-service`:
+
+### 1.1 Product-as-queue is conflated with repo-as-work-target
+
+- `fleet_jobs` is partitioned by **`/productId`**, and a factory is bound to a
+  single product via `AQ_PRODUCT_ID`. The job's **`repo`** is just a payload
+  field (the PR target). Routing uses `productId`; the repo is orthogonal.
+- Consequence observed: a `learning_ai_notes` job submitted via the form was
+  filed under **`chronomind`** (because the form's Factory dropdown maps
+  `mac-2 → chronomind`), and would have opened a PR to the notes repo from a
+  "chronomind" factory. Nothing ties the product to the repo, and nothing
+  guarantees the chosen factory even has that repo checked out.
+- The form (`dashboards/tracker-web/.../fleet/jobs/page.tsx`) hardcodes
+  `FLEET_FACTORIES = [mac-1→lysnrai, mac-2→chronomind]` and defaults
+  `capabilities = "build"` — a capability **no agent-queue factory ever
+  advertises** (`detect_capabilities` only emits `os:*`, `engine:*`, `node:*`,
+  `has:*`). So default UI submissions are unroutable to live factories.
+
+### 1.2 Pull-poll daemons burn Cosmos RU to stay "ready"
+
+- The run loop iterates every **`POLL_SECONDS=3`**; with `AQ_FLEET_ROUTE=1`
+  (default) each iteration calls `POST /fleet/claim`.
+- `claimNextJob` runs `repo.listJobs({ productId })` — **reads every job doc in
+  the product partition, no stage filter, no limit** — on every claim, plus a
+  `getLease` point-read per active job when preemption is on.
+- One process **per product** (`_start_fleet.sh` spawns 4) ⇒ ~`4 × (1/3s)` ≈
+  **115k claim queries/day at idle**, each scaling with partition size, billed
+  continuously whether or not work exists. The machine must also stay up running
+  the loop.
+
+> **Root cause:** `productId` is doing double duty as *tenant/billing scope* and
+> *work-routing queue*, and work discovery is a busy-poll against the state store.
+
+---
+
+## 2. Goals, non-goals, constraints
+
+**Goals**
+- Eliminate idle-poll RU cost; pay (near) zero when there is no work.
+- Make a factory a **generic build worker** (host + capabilities + engines +
+  checked-out repos), not a product-bound process.
+- Route work by what actually matters (**capabilities + repo**), while keeping
+  per-product **billing, budgets, visibility, and token scoping**.
+- Preserve the existing **weighted scheduler** and **leaseEpoch fencing**
+  (exactly-once assignment, zombie-writer protection).
+- Enable later **on-demand spawn** (scale-to-zero) without re-architecting.
+
+**Non-goals (this phase)**
+- Replacing Cosmos as the system of record for job/lease/event/budget **state**.
+- Rewriting the scheduler's scoring math.
+- Multi-region / cross-cloud dispatch.
+
+**Hard constraints (ecosystem rules)**
+- Every Cosmos doc keeps a `productId` (platform rule) — product stays a
+  first-class **tag**, even when it is no longer the routing key.
+- Per-product budgets (`fleet_budgets /productId`), enrollment tokens (§12), and
+  the `tracker-web` per-product views must keep working.
+- Changes must be flag-gated and reversible (match the existing
+  `AQ_FLEET` / `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` cutover discipline).
+
+---
+
+## 3. Decision summary
+
+1. **Do NOT build A3 ("single shared queue") inside Cosmos.** A single logical
+   queue tempts a hot partition; scaling it forces a synthetic partition key and
+   a **cross-partition "find next job" query**, which *increases* RU — the
+   opposite of the goal. It also dissolves the per-product isolation the
+   platform's tenancy/budget/token model depends on.
+2. **Get the shared-queue behavior from a real broker (B3), not from Cosmos.**
+   Adopt **Azure Service Bus** as the dispatch substrate. Cosmos remains
+   product-partitioned for **state**; the broker owns **delivery**.
+3. **Keep the scheduler.** Use a **coordinator-owns-scheduling /
+   broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
+   factory* should run a job and pushes a **targeted** message; the broker
+   handles transport, visibility timeout, retries, and dead-lettering.
+4. **Ship the cheap RU win first (B1) as Phase 0** — it is reversible, needs no
+   new infra, and de-risks the broker migration by removing the bleed while the
+   bigger change is built and shadowed.
+
+> Net: the shared-queue *experience* (generic workers, one work stream) comes
+> from Service Bus topics/subscriptions; Cosmos stays `/productId`-partitioned
+> for state, budgets, and visibility.
+
+---
+
+## 4. Target architecture
+
+### 4.1 Components & ownership
+
+| Concern | Owner (target) | Notes |
+| --- | --- | --- |
+| Job/lease/event/budget **state** | **Cosmos** (`/productId`, `/jobId` as today) | unchanged system of record |
+| **Scheduling** (which factory) | **Coordinator** (platform-service) | existing weighted scorer + preemption |
+| **Dispatch / delivery** | **Service Bus** | competing consumers, visibility timeout, DLQ |
+| **Fencing** (zombie writers) | **Cosmos `leaseEpoch`** | broker visibility ≠ correctness boundary |
+| Per-product billing/budgets/tokens | **Cosmos + coordinator** | enforced at submit + assign, not by partition |
+| Control planes | `tracker-web`, `agent-queue` dashboard | unchanged REST surface |
+
+### 4.2 Service Bus topology
+
+- One **topic** `fleet-dispatch`.
+- **Primary model — coordinator-targeted (preserves the scheduler):** the
+  coordinator picks the factory, then publishes a message stamped with
+  `targetFactoryId`. Each factory has its **own subscription** with a
+  **correlation filter** `targetFactoryId = '<me>'`. The broker does no policy —
+  it just delivers the scorer's decision. **This is the model the rest of this
+  doc assumes.**
+- **Fallback model — self-select (only if the scheduler is disabled):**
+  capability/repo **SQL filters** on message application properties let consumers
+  self-match. Multi-valued `capabilities` do **not** filter cleanly as one
+  string, so encode each as a boolean property (`cap_os_mac=true`,
+  `repo_learning_ai_notes=true`) rather than `LIKE '%…%'`. Subscription filters
+  are why Service Bus beats Storage Queue / SQS (which can't filter → a
+  queue-per-class sprawl).
+- **Messages stay small.** A message carries only
+  `{ jobId, productId, repo, caps, priority, targetFactoryId }` — **not**
+  `bodyMd`/manifest. The consumer reads the full job from Cosmos by `jobId`.
+  (Service Bus max message is **256 KB** Standard / 1 MB Premium; job bodies can
+  approach that — reinforcing "broker = transport, Cosmos = state".)
+- **DLQ** per subscription ⇒ maps onto `failed` / `retries_exhausted`.
+- **Sessions** (optional) keyed by `repo` to serialize same-repo work and avoid
+  worktree/branch contention on one host.
+
+### 4.3 Why this keeps the scheduler
+
+A vanilla broker is FIFO competing-consumers and does **no** weighted scoring.
+To preserve the existing scorer (`capabilityFit / affinity / load / costFit /
+health / starvation`) + preemption + seat limits, the coordinator stays in the
+decision path: it **selects the target factory** and publishes a message whose
+filter routes it to *that* factory's subscription (or a per-factory
+subscription). The broker is transport, not policy.
+
+---
+
+## 5. Key flows
+
+### 5.1 Submit → dispatch (consistency)
+
+The **Cosmos change feed on `fleet_jobs` is the durable, ordered event log**, so
+no separate outbox container is needed for the primary design:
+
+1. `submitJob` writes the `fleet_jobs` doc (`stage: queued`). That write *is* the
+   event.
+2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
+   feed (via a lease container), runs the scheduler for each new/`queued` job,
+   stamps `assignedFactoryId` on the job (CAS), and **publishes** the targeted
+   Service Bus message.
+3. **Crash-safe & idempotent:** the change feed redelivers from the last
+   checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
+   **`MessageId = jobId`** collapses re-publishes. The consumer is idempotent
+   because the authoritative claim is a Cosmos CAS on `leaseEpoch` — a second
+   delivery is simply fenced (`leaseEpoch` is assigned *at claim*, so it is **not**
+   a valid dedup key for the message itself).
+
+> A separate **transactional outbox** is only needed if you ever publish *inline*
+> at submit instead of via the change feed. Cross-container writes are **not
+> atomic** in Cosmos, so an outbox row would have to live in the **same container
+> + same partition** as the job and be written with a **Cosmos transactional
+> batch** — or, simpler, carried as an `outboxState` field on the job doc itself.
+> The change-feed design avoids this entirely.
+
+> Net effect: the per-factory busy-poll is replaced by one change-feed-driven
+> dispatcher. Idle cost is event-driven, not a per-3s full-partition scan.
+
+### 5.2 Deliver → claim → fence
+
+1. Factory **receives** a message (long-poll/`receiveMessages`, no RU).
+2. Factory calls `POST /fleet/claim` (or a lighter `/fleet/accept`) with
+   `{ jobId, factoryId }`. Coordinator does the **CAS lease** in Cosmos exactly
+   as today (`revUpdateJob` + `leaseEpoch` bump) and returns the new epoch.
+   409 ⇒ fenced ⇒ factory abandons the message (it goes back / to DLQ).
+3. The **broker lock** governs redelivery (a dead consumer's message reappears);
+   the **Cosmos `leaseEpoch`** governs *correctness* (a zombie writer is rejected
+   on PATCH). Two distinct mechanisms — do not collapse them.
+4. **Long-running jobs vs the broker lock.** Service Bus message lock max is
+   **5 minutes**; a coding job runs far longer. Two viable patterns:
+   - **(recommended) complete-on-claim:** complete the message immediately after
+     a successful Cosmos claim. The **Cosmos lease + reaper** then own liveness —
+     on crash the reaper sets the job back to `queued`, which is a change-feed
+     event that **re-dispatches** (§5.1). This decouples job runtime from the
+     5-min lock entirely.
+   - **renew-lock:** keep the message locked and call `renewMessageLock` on a
+     timer, reusing the existing `AQ_FLEET_LEASE_RENEW_SEC` cadence to renew
+     *both* the Cosmos lease and the broker lock. Simpler delivery semantics, but
+     couples runtime to the broker and risks redelivery storms on long jobs.
+
+### 5.3 Failure / retry / DLQ
+
+- Engine failure / verify-fail ⇒ coordinator transitions `failed`; broker
+  message completed (don't redeliver a logical failure).
+- Crash / lease-expiry ⇒ broker redelivers after visibility timeout **and** the
+  existing reaper reclaims the Cosmos lease (bumps epoch). The redelivered
+  message re-runs `/fleet/claim`, which fences the old holder.
+- Exhausted retries ⇒ broker DLQ + Cosmos `retries_exhausted`.
+
+### 5.4 Routing model (the §1.1 fix)
+
+- Job carries `repo` + required `capabilities` (real tokens: `os:*`, `engine:*`,
+  `has:git`, plus a new `repo:<name>` token).
+- The **scheduler** does the matching: it picks among factories that advertise
+  those caps **and** have the repo locally (or can clone it), then targets the
+  winner (§4.2 primary model: message stamped `targetFactoryId`, delivered via
+  that factory's correlation-filtered subscription).
+- **Product is a property/tag** used for billing/visibility and budget checks —
+  **not** the routing key. (In the self-select fallback, product/caps/repo become
+  subscription SQL filters instead.)
+- Fix the `tracker-web` form in lockstep: derive factories/repos from live data,
+  drop the bogus default `capabilities = "build"`, and stop hardcoding
+  `mac-1/mac-2`.
+
+---
+
+## 6. Per-product tenancy without product-partitioned queues
+
+- **Budgets:** checked by the coordinator at **assign time** (it already reads
+  `fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
+  dispatcher.
+- **Tokens (§12):** the factory token still scopes `productId + capabilities +
+  factoryId`; the coordinator enforces it on `/fleet/claim`. A factory may
+  consume cross-product messages **only** for products its token covers — model
+  this as per-product (or per-token) subscriptions/filters so least-privilege is
+  preserved.
+- **Visibility:** `tracker-web` keeps querying per product (state is still
+  product-partitioned), so the UX is unchanged.
+
+---
+
+## 7. Alternatives considered
+
+| Option | Verdict | Reason |
+| --- | --- | --- |
+| **A3 shared queue in Cosmos** | ✗ | hot partition; cross-partition claim = more RU; loses tenancy isolation |
+| **A1 validate ownership only** | partial | fixes "wrong factory" but not the RU/poll model or process-per-product |
+| **Storage Queue / SQS broker** | ✗ (for now) | no subscription filters ⇒ queue-per-capability sprawl; weaker DLQ/visibility ergonomics |
+| **B2 change feed, no broker** | viable | good for dispatch signal, but still needs a transport to *reach* factories; pairs naturally with B3 |
+| **Plain competing-consumers (drop scheduler)** | ✗ | throws away weighted scoring + preemption + cost/affinity routing |
+| **B3 Service Bus + coordinator hybrid** | ✓ chosen | zero idle RU, keeps scheduler + fencing, filters give capability/repo routing, paves path to B4 |
+
+---
+
+## 8. Phased migration
+
+> Steps are labelled **M0–M3** to avoid collision with the roadmap's Phase 0–5
+> numbering; all of M0–M3 sit *inside* roadmap **Phase 4**. The ticked checklist
+> is in §12.
+
+### M0 — RU quick win (no new infra, fully reversible) — *do now*
+- Add a per-product `queue_version`/`pending_count` doc bumped on submit/stage
+  change. The factory's loop does a **1-RU point-read** of that doc each tick and
+  only runs the expensive `listJobs`/claim when it changed.
+- Raise `POLL_SECONDS` (e.g. 3 → 15–30) and add jittered backoff when idle.
+- Expected: **~10–50× fewer claim queries at idle**, behavior otherwise
+  identical. Gate behind a flag; trivially revertible.
+
+### M1 — Stand up the broker in **shadow**
+- Provision Service Bus (`fleet-dispatch` topic + subscriptions) with
+  **managed-identity** auth (no connection-string keys in env/`.env`). Coordinator
+  publishes messages **in parallel** with the existing claim path but factories
+  still source work from Cosmos. Use the existing `AQ_FLEET_SHADOW` discipline:
+  record divergence (did the broker route match the scorer's pick?) without
+  acting on it.
+
+### M2 — Cutover delivery to the broker
+- Flip a flag so factories source work from Service Bus + `/fleet/claim` for
+  fencing; Cosmos poll path becomes the fallback only. Keep the reaper + lease
+  fencing untouched. Validate exactly-once + crash recovery on multi-host.
+
+### M3 — On-demand factories (B4)
+- KEDA / Container Apps scale-to-zero on subscription depth: spin a factory only
+  when depth > 0; idle ⇒ **zero** running workers and zero RU. Warm-pool a single
+  small instance if cold-start latency matters.
+
+---
+
+## 9. Risks & mitigations
+
+| Risk | Mitigation |
+| --- | --- |
+| Dual source-of-truth (broker + Cosmos) drift | change-feed *is* the log (no separate outbox); SB duplicate-detection on `MessageId=jobId`; claim is a Cosmos CAS on `leaseEpoch` |
+| Broker lock vs `leaseEpoch` confusion | explicit rule: broker lock = *delivery*, `leaseEpoch` = *correctness*; never merge (§5.2) |
+| Long job > 5-min broker lock | **complete-on-claim** (reaper + change feed re-dispatch) or `renewMessageLock` on the lease cadence (§5.2) |
+| Message > 256 KB | message carries `jobId` + routing props only; consumer reads body from Cosmos (§4.2) |
+| Same-repo worktree contention across hosts | Service Bus **sessions** keyed by `repo` to serialize same-repo jobs |
+| Lost scheduler features under FIFO | coordinator keeps assignment; broker only transports targeted messages |
+| Token scope leak in shared subscriptions | per-factory subscription + correlation filter; coordinator re-checks the §12 token on claim |
+| Secrets in env (`.env` keys) | **managed identity** for Service Bus + Cosmos; no connection-string keys committed |
+| Blind operation | emit metrics: subscription depth, dispatch lag, claim-conflict (409) rate, DLQ count, change-feed lag — wire to existing monitoring |
+| Migration regressions | M1 shadow measures divergence before any cutover; all flag-gated |
+
+---
+
+## 10. Open questions
+
+1. **Per-factory subscription scale.** The chosen coordinator-targeted model uses
+   one subscription per factory (correlation filter on `targetFactoryId`). Service
+   Bus allows up to **2,000 subscriptions/topic**, so this scales for realistic
+   fleets. If factory churn is high, fall back to a single subscription with a
+   per-consumer `targetFactoryId` SQL filter.
+2. **Where does the dispatcher run?** A new lightweight loop in platform-service
+   vs a separate worker. A change-feed lease container is required either way; a
+   single active dispatcher (leader-elected) avoids double-publish.
+3. **Cost envelope:** Service Bus tier (Standard vs Premium). Standard likely
+   sufficient; Premium only if sessions/large messages/VNet are needed. Confirm
+   against expected message volume.
+4. **Do we keep the Cosmos poll path permanently** as an offline/degraded
+   fallback (like today's `AQ_FLEET_ROUTE=0`)? Recommend yes.
+5. **Repo advertisement.** How does a factory tell the coordinator which repos it
+   has locally (for the `repo:<name>` capability)? Extend the heartbeat payload
+   with a `repos[]` list, or derive from `AQ_FLEET_REPO_BASE`.
+
+---
+
+## 11. Appendix — idle RU cost sketch (today vs M0 vs target)
+
+| Model | Claim/work-find ops at idle (4 factories) | Notes |
+| --- | --- | --- |
+| **Today** (poll 3s) | ~115k/day full-partition `listJobs` | scales with partition size; ~`4 × 28.8k` |
+| **M0** (poll 15–30s + gate) | ~12–23k/day **1-RU point-reads** + ~0 full scans | full scan only when the gate doc changes |
+| **Target (B3)** | ~0 | long-poll receive, no RU; full scan never on the hot path |
+
+> Figures are order-of-magnitude to frame the decision, not a billing estimate.
+> A full-partition `listJobs` costs many RU and grows with partition size; a
+> point-read is ~1 RU and flat. The point: idle cost goes from "linear in
+> partition size, forever" to "≈ zero".
+
+---
+
+## 12. Roadmap & checklist (roadmap Phase 4)
+
+Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the
+"wrong-factory / ineligible-capability" stranding is gone, exactly-once
+assignment + crash recovery still hold on multi-host, and every step is
+flag-gated + reversible.
+
+### M0 — RU quick win (no new infra)
+- [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change.
+- [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed.
+- [ ] Raise `POLL_SECONDS` default (3 → 15–30) + jittered idle backoff.
+- [ ] Flag-gate (e.g. `AQ_FLEET_GATE=1`) with a clean off-switch.
+- [ ] Verify: idle claim queries drop ~10–50×; functional behavior unchanged (selftest green).
+
+### Routing-model fix (lands with M0/M1)
+- [ ] Add `repo:<name>` capability token; factories advertise local repos via heartbeat (`repos[]`).
+- [ ] Scheduler matches on caps **+ repo**; product becomes a tag, not the routing key.
+- [ ] Fix `tracker-web` New-Job form: drop default `capabilities="build"`, stop hardcoding `mac-1/mac-2`, derive factories/repos from live data.
+- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
+
+### M1 — Broker in shadow
+- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions (managed identity, no keys).
+- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, publishes targeted messages (`MessageId=jobId`, dup-detection on).
+- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
+- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
+
+### M2 — Cutover delivery
+- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
+- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
+- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
+- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
+- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
+
+### M3 — On-demand factories (scale-to-zero)
+- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
+- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
+- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
+
+### Docs to update on completion
+- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
+- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.
+- [ ] common_plat `docs/GIGAFACTORY/` — mirror the backend/dispatcher changes.
diff --git a/agent-queue/docs/GIGAFACTORY/README.md b/agent-queue/docs/GIGAFACTORY/README.md
index bc64419..62bc117 100644
--- a/agent-queue/docs/GIGAFACTORY/README.md
+++ b/agent-queue/docs/GIGAFACTORY/README.md
@@ -10,6 +10,7 @@ multi-host factory of autonomous coding agents.
 | --- | --- |
 | [`GIGAFACTORY_ROADMAP.md`](GIGAFACTORY_ROADMAP.md) | The canonical source-of-truth spec: architecture, the evolved job manifest, scoring formula, lifecycle/retry, enrollment, and the phased checklists (§1–§17). Job specs in `../jobs/` point here. |
 | [`GIGAFACTORY_SYSTEM_OVERVIEW.md`](GIGAFACTORY_SYSTEM_OVERVIEW.md) | A narrative overview of how the pieces fit together end-to-end, with a code-map of the relevant files across both repos. |
+| [`FLEET_DISPATCH_REDESIGN.md`](FLEET_DISPATCH_REDESIGN.md) | Phase-4 design proposal (no code): broker-backed (Azure Service Bus) dispatch + on-demand factories that fixes the product-as-queue routing smell and the idle-poll Cosmos RU cost. Phased migration starting with a zero-infra RU quick win. |
 
 ## Related docs in the other repo