bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
saravanakumardb1 41d8067724 feat(agent-queue): M0 RU gate — skip the claim when the queue is unchanged
Adds AQ_FLEET_GATE (default OFF): the run loop point-reads the cheap per-product
queue version (GET /fleet/queue-state) and SKIPS the expensive /fleet/claim while
the version is unchanged and it is not mid-drain, with a periodic safety backstop
and fail-open-on-read-error so work is never stranded. Keeps POLL_SECONDS for
local job responsiveness rather than raising it globally. selftest 39b covers the
gate decisions; reconciles the M0 section of the dispatch redesign doc.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 23:19:01 -07:00

30 KiB
Raw Blame History

Fleet Dispatch Redesign — Broker-Backed, On-Demand Factories

Design proposal (no code yet). Companion to GIGAFACTORY_SYSTEM_OVERVIEW.md (what exists today) and GIGAFACTORY_ROADMAP.md (source-of-truth spec). This doc realizes roadmap Phase 4 ("Message bus + autoscaling") and the routing-model cleanup that comes with it. Last reviewed: 2026-05-31.

Review log

  • v1 (2026-05-31): initial proposal.
  • v2 (2026-05-31): self-review pass — reconciled the routing model (coordinator-targeted as primary), fixed the Cosmos outbox transactionality claim (change feed is the log), constrained message size (jobId + routing props only), addressed long-job vs Service Bus 5-min lock, corrected the idempotency key (MessageId = jobId), renamed migration steps M0M3 to avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a ticked roadmap checklist + auth/observability notes.
  • v3 (2026-05-31): added §5.5 Error handling & cleanup (current behavior + lease-release-on-failure, branch/worktree GC, same-repo worktree clobber). Review fixes: unified the field name to targetFactoryId (§5.1), reconciled §5.3 with the complete-on-claim model (broker is not the redelivery path), aligned §6 token scoping with per-factory subscriptions, and added the GC / POST /fleet/fail checklist block to §12.
  • v4 (2026-05-31): coverage audit — roadmap now maps 1:1 to the design via a coverage matrix. Closed plan gaps: M-prep (decisions/§10 + schema + containers + RBAC), correlation filter + dispatcher budget enforcement (M1), small-messages/body-from-Cosmos + token re-check + alerting (M2), and new Testing and Rollback & flags blocks. No design element is now without an implementation step.
  • v5 (2026-05-31): M0 implemented + shipped (fleet_queue_state + bump hooks + GET /fleet/queue-state in common_plat; AQ_FLEET_GATE gate-skip in agent-queue). Reconciled M0 to the as-built approach (gate the claim; keep POLL_SECONDS for local responsiveness rather than raising it globally) and ticked the M0 checklist. Backend vitest + gate logic verified.

1. Why this doc exists (the two smells)

Two structural problems surfaced while running the local fleet against tracker-web + platform-service:

1.1 Product-as-queue is conflated with repo-as-work-target

  • fleet_jobs is partitioned by /productId, and a factory is bound to a single product via AQ_PRODUCT_ID. The job's repo is just a payload field (the PR target). Routing uses productId; the repo is orthogonal.
  • Consequence observed: a learning_ai_notes job submitted via the form was filed under chronomind (because the form's Factory dropdown maps mac-2 → chronomind), and would have opened a PR to the notes repo from a "chronomind" factory. Nothing ties the product to the repo, and nothing guarantees the chosen factory even has that repo checked out.
  • The form (dashboards/tracker-web/.../fleet/jobs/page.tsx) hardcodes FLEET_FACTORIES = [mac-1→lysnrai, mac-2→chronomind] and defaults capabilities = "build" — a capability no agent-queue factory ever advertises (detect_capabilities only emits os:*, engine:*, node:*, has:*). So default UI submissions are unroutable to live factories.

1.2 Pull-poll daemons burn Cosmos RU to stay "ready"

  • The run loop iterates every POLL_SECONDS=3; with AQ_FLEET_ROUTE=1 (default) each iteration calls POST /fleet/claim.
  • claimNextJob runs repo.listJobs({ productId })reads every job doc in the product partition, no stage filter, no limit — on every claim, plus a getLease point-read per active job when preemption is on.
  • One process per product (_start_fleet.sh spawns 4) ⇒ ~4 × (1/3s)115k claim queries/day at idle, each scaling with partition size, billed continuously whether or not work exists. The machine must also stay up running the loop.

Root cause: productId is doing double duty as tenant/billing scope and work-routing queue, and work discovery is a busy-poll against the state store.


2. Goals, non-goals, constraints

Goals

  • Eliminate idle-poll RU cost; pay (near) zero when there is no work.
  • Make a factory a generic build worker (host + capabilities + engines + checked-out repos), not a product-bound process.
  • Route work by what actually matters (capabilities + repo), while keeping per-product billing, budgets, visibility, and token scoping.
  • Preserve the existing weighted scheduler and leaseEpoch fencing (exactly-once assignment, zombie-writer protection).
  • Enable later on-demand spawn (scale-to-zero) without re-architecting.

Non-goals (this phase)

  • Replacing Cosmos as the system of record for job/lease/event/budget state.
  • Rewriting the scheduler's scoring math.
  • Multi-region / cross-cloud dispatch.

Hard constraints (ecosystem rules)

  • Every Cosmos doc keeps a productId (platform rule) — product stays a first-class tag, even when it is no longer the routing key.
  • Per-product budgets (fleet_budgets /productId), enrollment tokens (§12), and the tracker-web per-product views must keep working.
  • Changes must be flag-gated and reversible (match the existing AQ_FLEET / AQ_FLEET_ROUTE / AQ_FLEET_SHADOW cutover discipline).

3. Decision summary

  1. Do NOT build A3 ("single shared queue") inside Cosmos. A single logical queue tempts a hot partition; scaling it forces a synthetic partition key and a cross-partition "find next job" query, which increases RU — the opposite of the goal. It also dissolves the per-product isolation the platform's tenancy/budget/token model depends on.
  2. Get the shared-queue behavior from a real broker (B3), not from Cosmos. Adopt Azure Service Bus as the dispatch substrate. Cosmos remains product-partitioned for state; the broker owns delivery.
  3. Keep the scheduler. Use a coordinator-owns-scheduling / broker-owns-delivery hybrid (B2 ⊕ B3): the coordinator decides which factory should run a job and pushes a targeted message; the broker handles transport, visibility timeout, retries, and dead-lettering.
  4. Ship the cheap RU win first (B1) as step M0 — it is reversible, needs no new infra, and de-risks the broker migration by removing the bleed while the bigger change is built and shadowed.

Net: the shared-queue experience (generic workers, one work stream) comes from Service Bus topics/subscriptions; Cosmos stays /productId-partitioned for state, budgets, and visibility.


4. Target architecture

4.1 Components & ownership

Concern Owner (target) Notes
Job/lease/event/budget state Cosmos (/productId, /jobId as today) unchanged system of record
Scheduling (which factory) Coordinator (platform-service) existing weighted scorer + preemption
Dispatch / delivery Service Bus competing consumers, visibility timeout, DLQ
Fencing (zombie writers) Cosmos leaseEpoch broker visibility ≠ correctness boundary
Per-product billing/budgets/tokens Cosmos + coordinator enforced at submit + assign, not by partition
Control planes tracker-web, agent-queue dashboard unchanged REST surface

4.2 Service Bus topology

  • One topic fleet-dispatch.
  • Primary model — coordinator-targeted (preserves the scheduler): the coordinator picks the factory, then publishes a message stamped with targetFactoryId. Each factory has its own subscription with a correlation filter targetFactoryId = '<me>'. The broker does no policy — it just delivers the scorer's decision. This is the model the rest of this doc assumes.
  • Fallback model — self-select (only if the scheduler is disabled): capability/repo SQL filters on message application properties let consumers self-match. Multi-valued capabilities do not filter cleanly as one string, so encode each as a boolean property (cap_os_mac=true, repo_learning_ai_notes=true) rather than LIKE '%…%'. Subscription filters are why Service Bus beats Storage Queue / SQS (which can't filter → a queue-per-class sprawl).
  • Messages stay small. A message carries only { jobId, productId, repo, caps, priority, targetFactoryId }not bodyMd/manifest. The consumer reads the full job from Cosmos by jobId. (Service Bus max message is 256 KB Standard / 1 MB Premium; job bodies can approach that — reinforcing "broker = transport, Cosmos = state".)
  • DLQ per subscription ⇒ maps onto failed / retries_exhausted.
  • Sessions (optional) keyed by repo to serialize same-repo work and avoid worktree/branch contention on one host.

4.3 Why this keeps the scheduler

A vanilla broker is FIFO competing-consumers and does no weighted scoring. To preserve the existing scorer (capabilityFit / affinity / load / costFit / health / starvation) + preemption + seat limits, the coordinator stays in the decision path: it selects the target factory and publishes a message whose filter routes it to that factory's subscription (or a per-factory subscription). The broker is transport, not policy.


5. Key flows

5.1 Submit → dispatch (consistency)

The Cosmos change feed on fleet_jobs is the durable, ordered event log, so no separate outbox container is needed for the primary design:

  1. submitJob writes the fleet_jobs doc (stage: queued). That write is the event.
  2. A single dispatcher (coordinator process) tails the fleet_jobs change feed (via a lease container), runs the scheduler for each new/queued job, stamps targetFactoryId on the job (CAS), and publishes the targeted Service Bus message.
  3. Crash-safe & idempotent: the change feed redelivers from the last checkpoint on dispatcher restart; Service Bus duplicate detection keyed on MessageId = jobId collapses re-publishes. The consumer is idempotent because the authoritative claim is a Cosmos CAS on leaseEpoch — a second delivery is simply fenced (leaseEpoch is assigned at claim, so it is not a valid dedup key for the message itself).

A separate transactional outbox is only needed if you ever publish inline at submit instead of via the change feed. Cross-container writes are not atomic in Cosmos, so an outbox row would have to live in the **same container

  • same partition** as the job and be written with a Cosmos transactional batch — or, simpler, carried as an outboxState field on the job doc itself. The change-feed design avoids this entirely.

Net effect: the per-factory busy-poll is replaced by one change-feed-driven dispatcher. Idle cost is event-driven, not a per-3s full-partition scan.

5.2 Deliver → claim → fence

  1. Factory receives a message (long-poll/receiveMessages, no RU).
  2. Factory calls POST /fleet/claim (or a lighter /fleet/accept) with { jobId, factoryId }. Coordinator does the CAS lease in Cosmos exactly as today (revUpdateJob + leaseEpoch bump) and returns the new epoch. 409 ⇒ fenced ⇒ factory abandons the message (it goes back / to DLQ).
  3. The broker lock governs redelivery (a dead consumer's message reappears); the Cosmos leaseEpoch governs correctness (a zombie writer is rejected on PATCH). Two distinct mechanisms — do not collapse them.
  4. Long-running jobs vs the broker lock. Service Bus message lock max is 5 minutes; a coding job runs far longer. Two viable patterns:
    • (recommended) complete-on-claim: complete the message immediately after a successful Cosmos claim. The Cosmos lease + reaper then own liveness — on crash the reaper sets the job back to queued, which is a change-feed event that re-dispatches (§5.1). This decouples job runtime from the 5-min lock entirely.
    • renew-lock: keep the message locked and call renewMessageLock on a timer, reusing the existing AQ_FLEET_LEASE_RENEW_SEC cadence to renew both the Cosmos lease and the broker lock. Simpler delivery semantics, but couples runtime to the broker and risks redelivery storms on long jobs.

5.3 Failure / retry / DLQ

Assumes the recommended complete-on-claim model (§5.2): the broker message is completed at claim, so the broker is not the redelivery path — re-dispatch is driven by Cosmos stage changes through the change feed (§5.1).

  • Logical failure (engine error / verify-fail) ⇒ coordinator transitions failed and releases the lease immediately (new /fleet/fail, see §5.5); no redelivery (a logical failure is terminal unless a retry policy applies).
  • Retryable failure ⇒ coordinator sets the job back to queued (attempts++, backoff) ⇒ change-feed re-dispatch to the next best factory.
  • Crash / lease-expiry ⇒ the reaper reclaims the Cosmos lease (bumps leaseEpoch, fencing the dead holder) and returns the job to queued ⇒ change-feed re-dispatch. (With the alternative renew-lock model, broker redelivery is the trigger instead — pick one, not both.)
  • Exhausted retries ⇒ Cosmos retries_exhausted; mirror to the broker DLQ for visibility.

5.4 Routing model (the §1.1 fix)

  • Job carries repo + required capabilities (real tokens: os:*, engine:*, has:git, plus a new repo:<name> token).
  • The scheduler does the matching: it picks among factories that advertise those caps and have the repo locally (or can clone it), then targets the winner (§4.2 primary model: message stamped targetFactoryId, delivered via that factory's correlation-filtered subscription).
  • Product is a property/tag used for billing/visibility and budget checks — not the routing key. (In the self-select fallback, product/caps/repo become subscription SQL filters instead.)
  • Fix the tracker-web form in lockstep: derive factories/repos from live data, drop the bogus default capabilities = "build", and stop hardcoding mac-1/mac-2.

5.5 Error handling & cleanup (worktrees, branches, leases)

Today (single-host, agent-queue.sh). The worker already handles errors well: the stage machine routes timeout/budget_exceeded/crash/verify_failed/ capability_mismatch/no_engine through _finish_failure (→ failed/, with a retry policy that requeues to inbox/ with backoff); a trap writes a WIP checkpoint to aq/wip/<job> on every exit path; recover_orphans requeues dead-worker building/ jobs; and a FENCED report (stale leaseEpoch) triggers fleet_quarantinefailed/ that never ships or merges (split-brain guard). PR/merge cleanup: .aq_pr.md is removed before commit; the PR branch aq/job/<jid> is deleted on auto-merge (--delete-branch); the repo worktree is force-recreated at the next job for that repo.

Gaps this redesign must close. These are real loose ends in the current code:

  1. No client-side lease release on failure. _finish_failure is fleet-agnostic, so a failed fleet job's lease only frees on expiry via the reaper — slow recovery. Target: a POST /fleet/fail (stage=failed/queued
    • release lease) so failure is reflected and the lease freed immediately.
  2. Unbounded git artifacts. aq/wip/<job> branches are never GC'd; worktrees are cleaned only on reuse; unmerged aq/job/<jid> branches accumulate on origin when auto-merge is off or blocked by branch protection. Target: a periodic GC sweep — delete merged aq/job/*, prune stale worktrees, and sweep aq/wip/* after a job reaches a terminal/shipped stage.
  3. Same-repo concurrency can clobber a worktree. The per-repo worktree is force-recreated, so two same-repo jobs on one host collide. Target: Service Bus sessions keyed by repo (serialize same-repo work) plus a per-(host, repo) lock as a local backstop.

Target invariants.

  • Terminal failure ⇒ Cosmos failed + lease released now (no expiry wait); DLQ mirrors retries_exhausted for visibility.
  • Crash / fence ⇒ reaper bumps leaseEpoch (fences zombie) ⇒ queued ⇒ change-feed re-dispatch (§5.3).
  • Cleanup is explicit and idempotent — safe to re-run, never deletes a branch with unmerged work or a worktree with an in-flight job. (Checklist in §12.)

6. Per-product tenancy without product-partitioned queues

  • Budgets: checked by the coordinator at assign time (it already reads fleet_budgets /productId in claimNextJob); unchanged, just moved to the dispatcher.
  • Tokens (§12): the factory token still scopes productId + capabilities + factoryId. In the primary (coordinator-targeted) model the dispatcher only ever targets a factory the scheduler deemed eligible, and the coordinator re-checks the token on /fleet/claim — so least-privilege holds without relying on the subscription topology. (In the self-select fallback, scope it with per-product/per-token subscription filters instead.)
  • Visibility: tracker-web keeps querying per product (state is still product-partitioned), so the UX is unchanged.

7. Alternatives considered

Option Verdict Reason
A3 shared queue in Cosmos hot partition; cross-partition claim = more RU; loses tenancy isolation
A1 validate ownership only partial fixes "wrong factory" but not the RU/poll model or process-per-product
Storage Queue / SQS broker ✗ (for now) no subscription filters ⇒ queue-per-capability sprawl; weaker DLQ/visibility ergonomics
B2 change feed, no broker viable good for dispatch signal, but still needs a transport to reach factories; pairs naturally with B3
Plain competing-consumers (drop scheduler) throws away weighted scoring + preemption + cost/affinity routing
B3 Service Bus + coordinator hybrid ✓ chosen zero idle RU, keeps scheduler + fencing, filters give capability/repo routing, paves path to B4

8. Phased migration

Steps are labelled M0M3 to avoid collision with the roadmap's Phase 05 numbering; all of M0M3 sit inside roadmap Phase 4. The ticked checklist is in §12.

M0 — RU quick win (no new infra, fully reversible) — IMPLEMENTED

  • Per-product fleet_queue_state doc holds a monotonic version, bumped on job create + every stage change (centralized in the repo layer, best-effort).
  • The factory run loop does a ~1-RU point-read (GET /fleet/queue-state) and skips the expensive claim while the version is unchanged and it is not mid-drain — rather than raising POLL_SECONDS globally (which would slow local, non-fleet job pickup). A periodic safety backstop + fail-open-on-read-error guarantee work is never stranded.
  • Gated behind AQ_FLEET_GATE=1 (default OFF ⇒ byte-for-byte prior behavior).
  • Expected: ~1050× fewer claim queries at idle, local responsiveness unchanged.
  • Code: common_plat services/platform-service/src/modules/fleet/{types,repository,routes}.ts
    • lib/cosmos-init.ts; agent-queue/lib/fleet-client.sh (fleet_gate_*) + the run-loop hook in agent-queue.sh. Tests: fleet vitest (repo bump + endpoint) + selftest 39b (gate decisions).

M1 — Stand up the broker in shadow

  • Provision Service Bus (fleet-dispatch topic + subscriptions) with managed-identity auth (no connection-string keys in env/.env). Coordinator publishes messages in parallel with the existing claim path but factories still source work from Cosmos. Use the existing AQ_FLEET_SHADOW discipline: record divergence (did the broker route match the scorer's pick?) without acting on it.

M2 — Cutover delivery to the broker

  • Flip a flag so factories source work from Service Bus + /fleet/claim for fencing; Cosmos poll path becomes the fallback only. Keep the reaper + lease fencing untouched. Validate exactly-once + crash recovery on multi-host.

M3 — On-demand factories (B4)

  • KEDA / Container Apps scale-to-zero on subscription depth: spin a factory only when depth > 0; idle ⇒ zero running workers and zero RU. Warm-pool a single small instance if cold-start latency matters.

9. Risks & mitigations

Risk Mitigation
Dual source-of-truth (broker + Cosmos) drift change-feed is the log (no separate outbox); SB duplicate-detection on MessageId=jobId; claim is a Cosmos CAS on leaseEpoch
Broker lock vs leaseEpoch confusion explicit rule: broker lock = delivery, leaseEpoch = correctness; never merge (§5.2)
Long job > 5-min broker lock complete-on-claim (reaper + change feed re-dispatch) or renewMessageLock on the lease cadence (§5.2)
Message > 256 KB message carries jobId + routing props only; consumer reads body from Cosmos (§4.2)
Same-repo worktree contention across hosts Service Bus sessions keyed by repo to serialize same-repo jobs
Lost scheduler features under FIFO coordinator keeps assignment; broker only transports targeted messages
Token scope leak in shared subscriptions per-factory subscription + correlation filter; coordinator re-checks the §12 token on claim
Secrets in env (.env keys) managed identity for Service Bus + Cosmos; no connection-string keys committed
Blind operation emit metrics: subscription depth, dispatch lag, claim-conflict (409) rate, DLQ count, change-feed lag — wire to existing monitoring
Migration regressions M1 shadow measures divergence before any cutover; all flag-gated

10. Open questions

  1. Per-factory subscription scale. The chosen coordinator-targeted model uses one subscription per factory (correlation filter on targetFactoryId). Service Bus allows up to 2,000 subscriptions/topic, so this scales for realistic fleets. If factory churn is high, fall back to a single subscription with a per-consumer targetFactoryId SQL filter.
  2. Where does the dispatcher run? A new lightweight loop in platform-service vs a separate worker. A change-feed lease container is required either way; a single active dispatcher (leader-elected) avoids double-publish.
  3. Cost envelope: Service Bus tier (Standard vs Premium). Standard likely sufficient; Premium only if sessions/large messages/VNet are needed. Confirm against expected message volume.
  4. Do we keep the Cosmos poll path permanently as an offline/degraded fallback (like today's AQ_FLEET_ROUTE=0)? Recommend yes.
  5. Repo advertisement. How does a factory tell the coordinator which repos it has locally (for the repo:<name> capability)? Extend the heartbeat payload with a repos[] list, or derive from AQ_FLEET_REPO_BASE.

11. Appendix — idle RU cost sketch (today vs M0 vs target)

Model Claim/work-find ops at idle (4 factories) Notes
Today (poll 3s) ~115k/day full-partition listJobs scales with partition size; ~4 × 28.8k
M0 (poll 1530s + gate) ~1223k/day 1-RU point-reads + ~0 full scans full scan only when the gate doc changes
Target (B3) ~0 long-poll receive, no RU; full scan never on the hot path

Figures are order-of-magnitude to frame the decision, not a billing estimate. A full-partition listJobs costs many RU and grows with partition size; a point-read is ~1 RU and flat. The point: idle cost goes from "linear in partition size, forever" to "≈ zero".


12. Roadmap & checklist (roadmap Phase 4)

Acceptance gate for the whole effort: idle work-find RU ≈ 0, the "wrong-factory / ineligible-capability" stranding is gone, exactly-once assignment + crash recovery still hold on multi-host, and every step is flag-gated + reversible.

Coverage matrix (design → plan)

Every design element maps to a checklist block below — no design decision is left without an implementation step.

Design element §ref Plan block
Idle-poll RU bleed §1.2 M0
Product-as-queue / wrong-factory §1.1, §5.4 Routing-model fix
Open questions / decisions §10 M-prep
Schema, containers, RBAC §4, §5, §6 M-prep
Service Bus topic + subscriptions + filters §4.2 M-prep, M1
Change-feed dispatcher + scheduler + targeting §4.3, §5.1 M1
Budget enforcement at assign §6 M1
Claim/fence + complete-on-claim §5.2 M2
Small messages (body from Cosmos) §4.2 M2
Token re-check at claim §6 M2
Metrics + alerting §9 M2
Failure→lease release, GC, same-repo clobber §5.5 Error handling & cleanup
Scale-to-zero on-demand §3, §5.1 M3
Tests (dispatcher, CAS/fencing, GC, shadow) §9 Testing
Rollback / flags per phase §3 Rollback & flags
Doc updates Docs to update

M-prep — Decisions & schema (closes §10; before M1)

  • Lock dispatcher placement (platform-service loop vs separate worker) + leader election so a single active dispatcher avoids double-publish (§10 Q2).
  • Lock Service Bus tier (Standard default; Premium only for sessions / large messages / VNet) (§10 Q3).
  • Lock subscription model (per-factory correlation filter default; single-subscription SQL filter if factory churn is high) (§10 Q1).
  • Confirm the Cosmos poll path stays as a permanent flag-gated fallback (AQ_FLEET_ROUTE=0) (§10 Q4).
  • Confirm repo-advertisement source: repos[] in the heartbeat, derived from AQ_FLEET_REPO_BASE (§10 Q5).
  • Schema: add targetFactoryId to FleetJobDoc, repos[] to FleetFactoryDoc; register a new fleet_queue_state doc (/productId) for the M0 gate; provision the change-feed lease container; update the container registry / COSMOS_AUTO_INIT.
  • RBAC via managed identity: dispatcher = Service Bus Sender, factories = Listener on their own subscription; no shared keys committed.

M0 — RU quick win (no new infra) — DONE

  • Add per-product fleet_queue_state doc; bump on create + every stage change (repo layer).
  • Factory loop point-reads the gate each tick; run the claim only when it changed / mid-drain / safety interval.
  • Keep POLL_SECONDS for local responsiveness; gate the claim, with a periodic safety backstop + fail-open (instead of raising the global poll interval).
  • Flag-gate AQ_FLEET_GATE=1 (default OFF) with a clean off-switch.
  • Tests: fleet vitest (repo bump + GET /fleet/queue-state) + selftest 39b (gate decisions) green; gate logic verified standalone.

Routing-model fix (lands with M0/M1)

  • Add repo:<name> capability token; factories advertise local repos via heartbeat (repos[]).
  • Scheduler matches on caps + repo; product becomes a tag, not the routing key.
  • Fix tracker-web New-Job form: drop default capabilities="build", stop hardcoding mac-1/mac-2, derive factories/repos from live data.
  • Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.

M1 — Broker in shadow

  • Provision Service Bus fleet-dispatch topic + per-factory subscriptions, each with a correlation filter targetFactoryId='<id>' (managed identity, no keys).
  • Change-feed dispatcher (leader-elected) tails fleet_jobs, runs scheduler, stamps targetFactoryId (CAS), publishes targeted messages (MessageId=jobId, dup-detection on).
  • Dispatcher enforces per-product budget (paused / ceiling) before publishing (relocates the claimNextJob budget check, §6).
  • Publish in shadow alongside the Cosmos claim path; record route divergence (no action taken).
  • Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.

M2 — Cutover delivery

  • Factories consume from Service Bus; /fleet/accept does the Cosmos CAS claim + returns leaseEpoch.
  • Messages carry {jobId, productId, repo, caps, priority, targetFactoryId} only; the consumer reads the full job body from Cosmos by jobId (256 KB limit, §4.2).
  • /fleet/accept (and /fleet/claim) re-checks the §12 factory token (productId + caps + factoryId) before granting the lease.
  • Implement complete-on-claim (reaper + change-feed re-dispatch owns liveness).
  • Cosmos poll path retained as flag-gated fallback (AQ_FLEET_ROUTE=0).
  • Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag — and wire alerts (DLQ depth > 0, dispatch lag > threshold) into existing monitoring.
  • Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ failed/retries_exhausted mapping correct.

Error handling & cleanup (lands with M2) — see §5.5

  • Add POST /fleet/fail so a failed job sets the coordinator stage + releases the lease immediately (no expiry wait); wire it into _finish_failure / fleet_quarantine.
  • GC sweep (idempotent): delete merged aq/job/* branches, prune stale worktrees, sweep aq/wip/* after a job reaches a terminal/shipped stage.
  • Prevent same-repo worktree clobber: Service Bus sessions keyed by repo + a per-(host, repo) local lock.
  • Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.

M3 — On-demand factories (scale-to-zero)

  • KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
  • Optional warm-pool (1 small instance) if cold-start latency matters.
  • Verify: zero idle workers + zero idle RU; cold-start latency within target.

Testing (every phase — tests are sacred)

  • Unit: dispatcher scheduling + publish, claim CAS + leaseEpoch fencing, /fleet/fail, GC idempotency, the M0 gate read/skip logic.
  • Integration: shadow-divergence harness (M1), exactly-once + crash recovery (M2), scale-to-zero behavior (M3).
  • Extend agent-queue/selftest.sh + platform-service vitest; CI green is the gate to advance each phase.

Rollback & flags (per phase)

  • Each phase ships behind a flag with a documented one-line rollback: M0 AQ_FLEET_GATE, M1 shadow (publishes but never acts), M2 AQ_FLEET_ROUTE / broker-source toggle, M3 scaler disable.
  • Verify each rollback returns to the prior working path with no data loss and no stranded leases/messages.

Docs to update on completion

  • GIGAFACTORY_ROADMAP.md — tick Phase 4; correct the stale §0 progress table.
  • GIGAFACTORY_SYSTEM_OVERVIEW.md — add the broker/dispatcher to the architecture + code map.
  • common_plat docs/GIGAFACTORY/ — mirror the backend/dispatcher changes.