bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
saravanakumardb1 41d8067724 feat(agent-queue): M0 RU gate — skip the claim when the queue is unchanged
Adds AQ_FLEET_GATE (default OFF): the run loop point-reads the cheap per-product
queue version (GET /fleet/queue-state) and SKIPS the expensive /fleet/claim while
the version is unchanged and it is not mid-drain, with a periodic safety backstop
and fail-open-on-read-error so work is never stranded. Keeps POLL_SECONDS for
local job responsiveness rather than raising it globally. selftest 39b covers the
gate decisions; reconciles the M0 section of the dispatch redesign doc.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 23:19:01 -07:00

516 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet Dispatch Redesign — Broker-Backed, On-Demand Factories
> Design proposal (no code yet). Companion to `GIGAFACTORY_SYSTEM_OVERVIEW.md`
> (what exists today) and `GIGAFACTORY_ROADMAP.md` (source-of-truth spec). This
> doc realizes roadmap **Phase 4** ("Message bus + autoscaling") and the
> routing-model cleanup that comes with it. Last reviewed: **2026-05-31**.
>
> **Review log**
> - v1 (2026-05-31): initial proposal.
> - v2 (2026-05-31): self-review pass — reconciled the routing model
> (coordinator-targeted as primary), fixed the Cosmos outbox transactionality
> claim (change feed *is* the log), constrained message size (jobId + routing
> props only), addressed long-job vs Service Bus 5-min lock, corrected the
> idempotency key (`MessageId = jobId`), renamed migration steps `M0M3` to
> avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
> ticked roadmap checklist + auth/observability notes.
> - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior +
> lease-release-on-failure, branch/worktree GC, same-repo worktree clobber).
> Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled
> §5.3 with the complete-on-claim model (broker is not the redelivery path),
> aligned §6 token scoping with per-factory subscriptions, and added the GC /
> `POST /fleet/fail` checklist block to §12.
> - v4 (2026-05-31): **coverage audit** — roadmap now maps 1:1 to the design via a
> coverage matrix. Closed plan gaps: **M-prep** (decisions/§10 + schema +
> containers + RBAC), correlation filter + dispatcher budget enforcement (M1),
> small-messages/body-from-Cosmos + token re-check + alerting (M2), and new
> **Testing** and **Rollback & flags** blocks. No design element is now without
> an implementation step.
> - v5 (2026-05-31): **M0 implemented + shipped** (`fleet_queue_state` + bump
> hooks + `GET /fleet/queue-state` in common_plat; `AQ_FLEET_GATE` gate-skip in
> agent-queue). Reconciled M0 to the as-built approach (gate the *claim*; keep
> `POLL_SECONDS` for local responsiveness rather than raising it globally) and
> ticked the M0 checklist. Backend vitest + gate logic verified.
---
## 1. Why this doc exists (the two smells)
Two structural problems surfaced while running the local fleet against
`tracker-web` + `platform-service`:
### 1.1 Product-as-queue is conflated with repo-as-work-target
- `fleet_jobs` is partitioned by **`/productId`**, and a factory is bound to a
single product via `AQ_PRODUCT_ID`. The job's **`repo`** is just a payload
field (the PR target). Routing uses `productId`; the repo is orthogonal.
- Consequence observed: a `learning_ai_notes` job submitted via the form was
filed under **`chronomind`** (because the form's Factory dropdown maps
`mac-2 → chronomind`), and would have opened a PR to the notes repo from a
"chronomind" factory. Nothing ties the product to the repo, and nothing
guarantees the chosen factory even has that repo checked out.
- The form (`dashboards/tracker-web/.../fleet/jobs/page.tsx`) hardcodes
`FLEET_FACTORIES = [mac-1→lysnrai, mac-2→chronomind]` and defaults
`capabilities = "build"` — a capability **no agent-queue factory ever
advertises** (`detect_capabilities` only emits `os:*`, `engine:*`, `node:*`,
`has:*`). So default UI submissions are unroutable to live factories.
### 1.2 Pull-poll daemons burn Cosmos RU to stay "ready"
- The run loop iterates every **`POLL_SECONDS=3`**; with `AQ_FLEET_ROUTE=1`
(default) each iteration calls `POST /fleet/claim`.
- `claimNextJob` runs `repo.listJobs({ productId })` — **reads every job doc in
the product partition, no stage filter, no limit** — on every claim, plus a
`getLease` point-read per active job when preemption is on.
- One process **per product** (`_start_fleet.sh` spawns 4) ⇒ ~`4 × (1/3s)` ≈
**115k claim queries/day at idle**, each scaling with partition size, billed
continuously whether or not work exists. The machine must also stay up running
the loop.
> **Root cause:** `productId` is doing double duty as *tenant/billing scope* and
> *work-routing queue*, and work discovery is a busy-poll against the state store.
---
## 2. Goals, non-goals, constraints
**Goals**
- Eliminate idle-poll RU cost; pay (near) zero when there is no work.
- Make a factory a **generic build worker** (host + capabilities + engines +
checked-out repos), not a product-bound process.
- Route work by what actually matters (**capabilities + repo**), while keeping
per-product **billing, budgets, visibility, and token scoping**.
- Preserve the existing **weighted scheduler** and **leaseEpoch fencing**
(exactly-once assignment, zombie-writer protection).
- Enable later **on-demand spawn** (scale-to-zero) without re-architecting.
**Non-goals (this phase)**
- Replacing Cosmos as the system of record for job/lease/event/budget **state**.
- Rewriting the scheduler's scoring math.
- Multi-region / cross-cloud dispatch.
**Hard constraints (ecosystem rules)**
- Every Cosmos doc keeps a `productId` (platform rule) — product stays a
first-class **tag**, even when it is no longer the routing key.
- Per-product budgets (`fleet_budgets /productId`), enrollment tokens (§12), and
the `tracker-web` per-product views must keep working.
- Changes must be flag-gated and reversible (match the existing
`AQ_FLEET` / `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` cutover discipline).
---
## 3. Decision summary
1. **Do NOT build A3 ("single shared queue") inside Cosmos.** A single logical
queue tempts a hot partition; scaling it forces a synthetic partition key and
a **cross-partition "find next job" query**, which *increases* RU — the
opposite of the goal. It also dissolves the per-product isolation the
platform's tenancy/budget/token model depends on.
2. **Get the shared-queue behavior from a real broker (B3), not from Cosmos.**
Adopt **Azure Service Bus** as the dispatch substrate. Cosmos remains
product-partitioned for **state**; the broker owns **delivery**.
3. **Keep the scheduler.** Use a **coordinator-owns-scheduling /
broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
factory* should run a job and pushes a **targeted** message; the broker
handles transport, visibility timeout, retries, and dead-lettering.
4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no
new infra, and de-risks the broker migration by removing the bleed while the
bigger change is built and shadowed.
> Net: the shared-queue *experience* (generic workers, one work stream) comes
> from Service Bus topics/subscriptions; Cosmos stays `/productId`-partitioned
> for state, budgets, and visibility.
---
## 4. Target architecture
### 4.1 Components & ownership
| Concern | Owner (target) | Notes |
| --- | --- | --- |
| Job/lease/event/budget **state** | **Cosmos** (`/productId`, `/jobId` as today) | unchanged system of record |
| **Scheduling** (which factory) | **Coordinator** (platform-service) | existing weighted scorer + preemption |
| **Dispatch / delivery** | **Service Bus** | competing consumers, visibility timeout, DLQ |
| **Fencing** (zombie writers) | **Cosmos `leaseEpoch`** | broker visibility ≠ correctness boundary |
| Per-product billing/budgets/tokens | **Cosmos + coordinator** | enforced at submit + assign, not by partition |
| Control planes | `tracker-web`, `agent-queue` dashboard | unchanged REST surface |
### 4.2 Service Bus topology
- One **topic** `fleet-dispatch`.
- **Primary model — coordinator-targeted (preserves the scheduler):** the
coordinator picks the factory, then publishes a message stamped with
`targetFactoryId`. Each factory has its **own subscription** with a
**correlation filter** `targetFactoryId = '<me>'`. The broker does no policy —
it just delivers the scorer's decision. **This is the model the rest of this
doc assumes.**
- **Fallback model — self-select (only if the scheduler is disabled):**
capability/repo **SQL filters** on message application properties let consumers
self-match. Multi-valued `capabilities` do **not** filter cleanly as one
string, so encode each as a boolean property (`cap_os_mac=true`,
`repo_learning_ai_notes=true`) rather than `LIKE '%…%'`. Subscription filters
are why Service Bus beats Storage Queue / SQS (which can't filter → a
queue-per-class sprawl).
- **Messages stay small.** A message carries only
`{ jobId, productId, repo, caps, priority, targetFactoryId }`**not**
`bodyMd`/manifest. The consumer reads the full job from Cosmos by `jobId`.
(Service Bus max message is **256 KB** Standard / 1 MB Premium; job bodies can
approach that — reinforcing "broker = transport, Cosmos = state".)
- **DLQ** per subscription ⇒ maps onto `failed` / `retries_exhausted`.
- **Sessions** (optional) keyed by `repo` to serialize same-repo work and avoid
worktree/branch contention on one host.
### 4.3 Why this keeps the scheduler
A vanilla broker is FIFO competing-consumers and does **no** weighted scoring.
To preserve the existing scorer (`capabilityFit / affinity / load / costFit /
health / starvation`) + preemption + seat limits, the coordinator stays in the
decision path: it **selects the target factory** and publishes a message whose
filter routes it to *that* factory's subscription (or a per-factory
subscription). The broker is transport, not policy.
---
## 5. Key flows
### 5.1 Submit → dispatch (consistency)
The **Cosmos change feed on `fleet_jobs` is the durable, ordered event log**, so
no separate outbox container is needed for the primary design:
1. `submitJob` writes the `fleet_jobs` doc (`stage: queued`). That write *is* the
event.
2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
feed (via a lease container), runs the scheduler for each new/`queued` job,
stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted
Service Bus message.
3. **Crash-safe & idempotent:** the change feed redelivers from the last
checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
**`MessageId = jobId`** collapses re-publishes. The consumer is idempotent
because the authoritative claim is a Cosmos CAS on `leaseEpoch` — a second
delivery is simply fenced (`leaseEpoch` is assigned *at claim*, so it is **not**
a valid dedup key for the message itself).
> A separate **transactional outbox** is only needed if you ever publish *inline*
> at submit instead of via the change feed. Cross-container writes are **not
> atomic** in Cosmos, so an outbox row would have to live in the **same container
> + same partition** as the job and be written with a **Cosmos transactional
> batch** — or, simpler, carried as an `outboxState` field on the job doc itself.
> The change-feed design avoids this entirely.
> Net effect: the per-factory busy-poll is replaced by one change-feed-driven
> dispatcher. Idle cost is event-driven, not a per-3s full-partition scan.
### 5.2 Deliver → claim → fence
1. Factory **receives** a message (long-poll/`receiveMessages`, no RU).
2. Factory calls `POST /fleet/claim` (or a lighter `/fleet/accept`) with
`{ jobId, factoryId }`. Coordinator does the **CAS lease** in Cosmos exactly
as today (`revUpdateJob` + `leaseEpoch` bump) and returns the new epoch.
409 ⇒ fenced ⇒ factory abandons the message (it goes back / to DLQ).
3. The **broker lock** governs redelivery (a dead consumer's message reappears);
the **Cosmos `leaseEpoch`** governs *correctness* (a zombie writer is rejected
on PATCH). Two distinct mechanisms — do not collapse them.
4. **Long-running jobs vs the broker lock.** Service Bus message lock max is
**5 minutes**; a coding job runs far longer. Two viable patterns:
- **(recommended) complete-on-claim:** complete the message immediately after
a successful Cosmos claim. The **Cosmos lease + reaper** then own liveness —
on crash the reaper sets the job back to `queued`, which is a change-feed
event that **re-dispatches** (§5.1). This decouples job runtime from the
5-min lock entirely.
- **renew-lock:** keep the message locked and call `renewMessageLock` on a
timer, reusing the existing `AQ_FLEET_LEASE_RENEW_SEC` cadence to renew
*both* the Cosmos lease and the broker lock. Simpler delivery semantics, but
couples runtime to the broker and risks redelivery storms on long jobs.
### 5.3 Failure / retry / DLQ
Assumes the recommended **complete-on-claim** model (§5.2): the broker message is
completed at claim, so the broker is **not** the redelivery path — re-dispatch is
driven by Cosmos stage changes through the change feed (§5.1).
- **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions
`failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5);
no redelivery (a logical failure is terminal unless a retry policy applies).
- **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++,
backoff) ⇒ change-feed re-dispatch to the next best factory.
- **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps
`leaseEpoch`, fencing the dead holder) and returns the job to `queued`
change-feed re-dispatch. (With the alternative *renew-lock* model, broker
redelivery is the trigger instead — pick one, not both.)
- **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ
for visibility.
### 5.4 Routing model (the §1.1 fix)
- Job carries `repo` + required `capabilities` (real tokens: `os:*`, `engine:*`,
`has:git`, plus a new `repo:<name>` token).
- The **scheduler** does the matching: it picks among factories that advertise
those caps **and** have the repo locally (or can clone it), then targets the
winner (§4.2 primary model: message stamped `targetFactoryId`, delivered via
that factory's correlation-filtered subscription).
- **Product is a property/tag** used for billing/visibility and budget checks —
**not** the routing key. (In the self-select fallback, product/caps/repo become
subscription SQL filters instead.)
- Fix the `tracker-web` form in lockstep: derive factories/repos from live data,
drop the bogus default `capabilities = "build"`, and stop hardcoding
`mac-1/mac-2`.
### 5.5 Error handling & cleanup (worktrees, branches, leases)
**Today (single-host, agent-queue.sh).** The worker already handles errors well:
the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/
`capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a
retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP
checkpoint to `aq/wip/<job>` on **every** exit path; `recover_orphans` requeues
dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`)
triggers `fleet_quarantine``failed/` that **never ships or merges**
(split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the
PR branch `aq/job/<jid>` is deleted on auto-merge (`--delete-branch`); the repo
worktree is force-recreated at the next job for that repo.
**Gaps this redesign must close.** These are real loose ends in the current code:
1. **No client-side lease release on failure.** `_finish_failure` is
fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the
reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued`
+ release lease) so failure is reflected and the lease freed **immediately**.
2. **Unbounded git artifacts.** `aq/wip/<job>` branches are never GC'd; worktrees
are cleaned only on reuse; unmerged `aq/job/<jid>` branches accumulate on
origin when auto-merge is off or blocked by branch protection. Target: a
periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and
sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is
force-recreated, so two same-repo jobs on one host collide. Target: **Service
Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host,
repo)` lock as a local backstop.
**Target invariants.**
- Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ
mirrors `retries_exhausted` for visibility.
- Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued`
change-feed re-dispatch (§5.3).
- Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch
with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
---
## 6. Per-product tenancy without product-partitioned queues
- **Budgets:** checked by the coordinator at **assign time** (it already reads
`fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
dispatcher.
- **Tokens (§12):** the factory token still scopes `productId + capabilities +
factoryId`. In the primary (coordinator-targeted) model the dispatcher only
ever targets a factory the scheduler deemed eligible, and the coordinator
**re-checks the token on `/fleet/claim`** — so least-privilege holds without
relying on the subscription topology. (In the self-select fallback, scope it
with per-product/per-token subscription filters instead.)
- **Visibility:** `tracker-web` keeps querying per product (state is still
product-partitioned), so the UX is unchanged.
---
## 7. Alternatives considered
| Option | Verdict | Reason |
| --- | --- | --- |
| **A3 shared queue in Cosmos** | ✗ | hot partition; cross-partition claim = more RU; loses tenancy isolation |
| **A1 validate ownership only** | partial | fixes "wrong factory" but not the RU/poll model or process-per-product |
| **Storage Queue / SQS broker** | ✗ (for now) | no subscription filters ⇒ queue-per-capability sprawl; weaker DLQ/visibility ergonomics |
| **B2 change feed, no broker** | viable | good for dispatch signal, but still needs a transport to *reach* factories; pairs naturally with B3 |
| **Plain competing-consumers (drop scheduler)** | ✗ | throws away weighted scoring + preemption + cost/affinity routing |
| **B3 Service Bus + coordinator hybrid** | ✓ chosen | zero idle RU, keeps scheduler + fencing, filters give capability/repo routing, paves path to B4 |
---
## 8. Phased migration
> Steps are labelled **M0M3** to avoid collision with the roadmap's Phase 05
> numbering; all of M0M3 sit *inside* roadmap **Phase 4**. The ticked checklist
> is in §12.
### M0 — RU quick win (no new infra, fully reversible) — *IMPLEMENTED*
- Per-product `fleet_queue_state` doc holds a monotonic `version`, bumped on job
create + every stage change (centralized in the repo layer, best-effort).
- The factory run loop does a **~1-RU point-read** (`GET /fleet/queue-state`) and
**skips the expensive claim** while the version is unchanged and it is not
mid-drain — rather than raising `POLL_SECONDS` globally (which would slow local,
non-fleet job pickup). A periodic safety backstop + fail-open-on-read-error
guarantee work is never stranded.
- Gated behind **`AQ_FLEET_GATE=1`** (default OFF ⇒ byte-for-byte prior behavior).
- Expected: **~1050× fewer claim queries at idle**, local responsiveness
unchanged.
- Code: common_plat `services/platform-service/src/modules/fleet/{types,repository,routes}.ts`
+ `lib/cosmos-init.ts`; `agent-queue/lib/fleet-client.sh` (`fleet_gate_*`) + the
run-loop hook in `agent-queue.sh`. Tests: fleet vitest (repo bump + endpoint) +
selftest `39b` (gate decisions).
### M1 — Stand up the broker in **shadow**
- Provision Service Bus (`fleet-dispatch` topic + subscriptions) with
**managed-identity** auth (no connection-string keys in env/`.env`). Coordinator
publishes messages **in parallel** with the existing claim path but factories
still source work from Cosmos. Use the existing `AQ_FLEET_SHADOW` discipline:
record divergence (did the broker route match the scorer's pick?) without
acting on it.
### M2 — Cutover delivery to the broker
- Flip a flag so factories source work from Service Bus + `/fleet/claim` for
fencing; Cosmos poll path becomes the fallback only. Keep the reaper + lease
fencing untouched. Validate exactly-once + crash recovery on multi-host.
### M3 — On-demand factories (B4)
- KEDA / Container Apps scale-to-zero on subscription depth: spin a factory only
when depth > 0; idle ⇒ **zero** running workers and zero RU. Warm-pool a single
small instance if cold-start latency matters.
---
## 9. Risks & mitigations
| Risk | Mitigation |
| --- | --- |
| Dual source-of-truth (broker + Cosmos) drift | change-feed *is* the log (no separate outbox); SB duplicate-detection on `MessageId=jobId`; claim is a Cosmos CAS on `leaseEpoch` |
| Broker lock vs `leaseEpoch` confusion | explicit rule: broker lock = *delivery*, `leaseEpoch` = *correctness*; never merge (§5.2) |
| Long job > 5-min broker lock | **complete-on-claim** (reaper + change feed re-dispatch) or `renewMessageLock` on the lease cadence (§5.2) |
| Message > 256 KB | message carries `jobId` + routing props only; consumer reads body from Cosmos (§4.2) |
| Same-repo worktree contention across hosts | Service Bus **sessions** keyed by `repo` to serialize same-repo jobs |
| Lost scheduler features under FIFO | coordinator keeps assignment; broker only transports targeted messages |
| Token scope leak in shared subscriptions | per-factory subscription + correlation filter; coordinator re-checks the §12 token on claim |
| Secrets in env (`.env` keys) | **managed identity** for Service Bus + Cosmos; no connection-string keys committed |
| Blind operation | emit metrics: subscription depth, dispatch lag, claim-conflict (409) rate, DLQ count, change-feed lag — wire to existing monitoring |
| Migration regressions | M1 shadow measures divergence before any cutover; all flag-gated |
---
## 10. Open questions
1. **Per-factory subscription scale.** The chosen coordinator-targeted model uses
one subscription per factory (correlation filter on `targetFactoryId`). Service
Bus allows up to **2,000 subscriptions/topic**, so this scales for realistic
fleets. If factory churn is high, fall back to a single subscription with a
per-consumer `targetFactoryId` SQL filter.
2. **Where does the dispatcher run?** A new lightweight loop in platform-service
vs a separate worker. A change-feed lease container is required either way; a
single active dispatcher (leader-elected) avoids double-publish.
3. **Cost envelope:** Service Bus tier (Standard vs Premium). Standard likely
sufficient; Premium only if sessions/large messages/VNet are needed. Confirm
against expected message volume.
4. **Do we keep the Cosmos poll path permanently** as an offline/degraded
fallback (like today's `AQ_FLEET_ROUTE=0`)? Recommend yes.
5. **Repo advertisement.** How does a factory tell the coordinator which repos it
has locally (for the `repo:<name>` capability)? Extend the heartbeat payload
with a `repos[]` list, or derive from `AQ_FLEET_REPO_BASE`.
---
## 11. Appendix — idle RU cost sketch (today vs M0 vs target)
| Model | Claim/work-find ops at idle (4 factories) | Notes |
| --- | --- | --- |
| **Today** (poll 3s) | ~115k/day full-partition `listJobs` | scales with partition size; ~`4 × 28.8k` |
| **M0** (poll 1530s + gate) | ~1223k/day **1-RU point-reads** + ~0 full scans | full scan only when the gate doc changes |
| **Target (B3)** | ~0 | long-poll receive, no RU; full scan never on the hot path |
> Figures are order-of-magnitude to frame the decision, not a billing estimate.
> A full-partition `listJobs` costs many RU and grows with partition size; a
> point-read is ~1 RU and flat. The point: idle cost goes from "linear in
> partition size, forever" to "≈ zero".
---
## 12. Roadmap & checklist (roadmap Phase 4)
Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the
"wrong-factory / ineligible-capability" stranding is gone, exactly-once
assignment + crash recovery still hold on multi-host, and every step is
flag-gated + reversible.
### Coverage matrix (design → plan)
Every design element maps to a checklist block below — no design decision is left
without an implementation step.
| Design element | §ref | Plan block |
| --- | --- | --- |
| Idle-poll RU bleed | §1.2 | M0 |
| Product-as-queue / wrong-factory | §1.1, §5.4 | Routing-model fix |
| Open questions / decisions | §10 | M-prep |
| Schema, containers, RBAC | §4, §5, §6 | M-prep |
| Service Bus topic + subscriptions + filters | §4.2 | M-prep, M1 |
| Change-feed dispatcher + scheduler + targeting | §4.3, §5.1 | M1 |
| Budget enforcement at assign | §6 | M1 |
| Claim/fence + complete-on-claim | §5.2 | M2 |
| Small messages (body from Cosmos) | §4.2 | M2 |
| Token re-check at claim | §6 | M2 |
| Metrics + alerting | §9 | M2 |
| Failure→lease release, GC, same-repo clobber | §5.5 | Error handling & cleanup |
| Scale-to-zero on-demand | §3, §5.1 | M3 |
| Tests (dispatcher, CAS/fencing, GC, shadow) | §9 | Testing |
| Rollback / flags per phase | §3 | Rollback & flags |
| Doc updates | — | Docs to update |
### M-prep — Decisions & schema (closes §10; before M1)
- [ ] Lock dispatcher placement (platform-service loop vs separate worker) + **leader election** so a single active dispatcher avoids double-publish (§10 Q2).
- [ ] Lock Service Bus tier (Standard default; Premium only for sessions / large messages / VNet) (§10 Q3).
- [ ] Lock subscription model (per-factory correlation filter default; single-subscription SQL filter if factory churn is high) (§10 Q1).
- [ ] Confirm the Cosmos poll path stays as a **permanent** flag-gated fallback (`AQ_FLEET_ROUTE=0`) (§10 Q4).
- [ ] Confirm repo-advertisement source: `repos[]` in the heartbeat, derived from `AQ_FLEET_REPO_BASE` (§10 Q5).
- [ ] Schema: add `targetFactoryId` to `FleetJobDoc`, `repos[]` to `FleetFactoryDoc`; register a new `fleet_queue_state` doc (`/productId`) for the M0 gate; provision the change-feed **lease container**; update the container registry / `COSMOS_AUTO_INIT`.
- [ ] RBAC via managed identity: dispatcher = Service Bus **Sender**, factories = **Listener** on their own subscription; no shared keys committed.
### M0 — RU quick win (no new infra) — ✅ DONE
- [x] Add per-product `fleet_queue_state` doc; bump on create + every stage change (repo layer).
- [x] Factory loop point-reads the gate each tick; run the claim only when it changed / mid-drain / safety interval.
- [x] Keep `POLL_SECONDS` for local responsiveness; gate the *claim*, with a periodic safety backstop + fail-open (instead of raising the global poll interval).
- [x] Flag-gate `AQ_FLEET_GATE=1` (default OFF) with a clean off-switch.
- [x] Tests: fleet vitest (repo bump + `GET /fleet/queue-state`) + selftest `39b` (gate decisions) green; gate logic verified standalone.
### Routing-model fix (lands with M0/M1)
- [ ] Add `repo:<name>` capability token; factories advertise local repos via heartbeat (`repos[]`).
- [ ] Scheduler matches on caps **+ repo**; product becomes a tag, not the routing key.
- [ ] Fix `tracker-web` New-Job form: drop default `capabilities="build"`, stop hardcoding `mac-1/mac-2`, derive factories/repos from live data.
- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
### M1 — Broker in shadow
- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions, each with a **correlation filter** `targetFactoryId='<id>'` (managed identity, no keys).
- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, stamps `targetFactoryId` (CAS), publishes targeted messages (`MessageId=jobId`, dup-detection on).
- [ ] Dispatcher enforces per-product **budget** (paused / ceiling) before publishing (relocates the `claimNextJob` budget check, §6).
- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
### M2 — Cutover delivery
- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
- [ ] Messages carry `{jobId, productId, repo, caps, priority, targetFactoryId}` only; the consumer **reads the full job body from Cosmos** by `jobId` (256 KB limit, §4.2).
- [ ] `/fleet/accept` (and `/fleet/claim`) **re-checks the §12 factory token** (productId + caps + factoryId) before granting the lease.
- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag — **and wire alerts** (DLQ depth > 0, dispatch lag > threshold) into existing monitoring.
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
### Error handling & cleanup (lands with M2) — see §5.5
- [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`.
- [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
- [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock.
- [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
### M3 — On-demand factories (scale-to-zero)
- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
### Testing (every phase — tests are sacred)
- [ ] Unit: dispatcher scheduling + publish, claim CAS + `leaseEpoch` fencing, `/fleet/fail`, GC idempotency, the M0 gate read/skip logic.
- [ ] Integration: shadow-divergence harness (M1), exactly-once + crash recovery (M2), scale-to-zero behavior (M3).
- [ ] Extend `agent-queue/selftest.sh` + platform-service `vitest`; **CI green is the gate** to advance each phase.
### Rollback & flags (per phase)
- [ ] Each phase ships behind a flag with a documented one-line rollback: M0 `AQ_FLEET_GATE`, M1 shadow (publishes but never acts), M2 `AQ_FLEET_ROUTE` / broker-source toggle, M3 scaler disable.
- [ ] Verify each rollback returns to the prior working path with **no data loss** and no stranded leases/messages.
### Docs to update on completion
- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.
- [ ] common_plat `docs/GIGAFACTORY/` — mirror the backend/dispatcher changes.