bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
saravanakumardb1 8217a864e9 docs(gigafactory): add Phase-4 fleet dispatch redesign (broker + on-demand)
Proposes moving fleet work-dispatch off Cosmos busy-polling onto Azure Service
Bus in a coordinator-owns-scheduling / broker-owns-delivery hybrid, fixing the
product-as-queue routing smell and the idle-poll RU cost. Includes phased
migration (M0 RU quick win -> shadow -> cutover -> scale-to-zero) with a ticked
checklist. Self-reviewed (v2) for the outbox/change-feed, message-size,
long-job lock, idempotency, and routing-model consistency issues.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 22:24:55 -07:00

393 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet Dispatch Redesign — Broker-Backed, On-Demand Factories
> Design proposal (no code yet). Companion to `GIGAFACTORY_SYSTEM_OVERVIEW.md`
> (what exists today) and `GIGAFACTORY_ROADMAP.md` (source-of-truth spec). This
> doc realizes roadmap **Phase 4** ("Message bus + autoscaling") and the
> routing-model cleanup that comes with it. Last reviewed: **2026-05-31**.
>
> **Review log**
> - v1 (2026-05-31): initial proposal.
> - v2 (2026-05-31): self-review pass — reconciled the routing model
> (coordinator-targeted as primary), fixed the Cosmos outbox transactionality
> claim (change feed *is* the log), constrained message size (jobId + routing
> props only), addressed long-job vs Service Bus 5-min lock, corrected the
> idempotency key (`MessageId = jobId`), renamed migration steps `M0M3` to
> avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
> ticked roadmap checklist + auth/observability notes.
---
## 1. Why this doc exists (the two smells)
Two structural problems surfaced while running the local fleet against
`tracker-web` + `platform-service`:
### 1.1 Product-as-queue is conflated with repo-as-work-target
- `fleet_jobs` is partitioned by **`/productId`**, and a factory is bound to a
single product via `AQ_PRODUCT_ID`. The job's **`repo`** is just a payload
field (the PR target). Routing uses `productId`; the repo is orthogonal.
- Consequence observed: a `learning_ai_notes` job submitted via the form was
filed under **`chronomind`** (because the form's Factory dropdown maps
`mac-2 → chronomind`), and would have opened a PR to the notes repo from a
"chronomind" factory. Nothing ties the product to the repo, and nothing
guarantees the chosen factory even has that repo checked out.
- The form (`dashboards/tracker-web/.../fleet/jobs/page.tsx`) hardcodes
`FLEET_FACTORIES = [mac-1→lysnrai, mac-2→chronomind]` and defaults
`capabilities = "build"` — a capability **no agent-queue factory ever
advertises** (`detect_capabilities` only emits `os:*`, `engine:*`, `node:*`,
`has:*`). So default UI submissions are unroutable to live factories.
### 1.2 Pull-poll daemons burn Cosmos RU to stay "ready"
- The run loop iterates every **`POLL_SECONDS=3`**; with `AQ_FLEET_ROUTE=1`
(default) each iteration calls `POST /fleet/claim`.
- `claimNextJob` runs `repo.listJobs({ productId })` — **reads every job doc in
the product partition, no stage filter, no limit** — on every claim, plus a
`getLease` point-read per active job when preemption is on.
- One process **per product** (`_start_fleet.sh` spawns 4) ⇒ ~`4 × (1/3s)` ≈
**115k claim queries/day at idle**, each scaling with partition size, billed
continuously whether or not work exists. The machine must also stay up running
the loop.
> **Root cause:** `productId` is doing double duty as *tenant/billing scope* and
> *work-routing queue*, and work discovery is a busy-poll against the state store.
---
## 2. Goals, non-goals, constraints
**Goals**
- Eliminate idle-poll RU cost; pay (near) zero when there is no work.
- Make a factory a **generic build worker** (host + capabilities + engines +
checked-out repos), not a product-bound process.
- Route work by what actually matters (**capabilities + repo**), while keeping
per-product **billing, budgets, visibility, and token scoping**.
- Preserve the existing **weighted scheduler** and **leaseEpoch fencing**
(exactly-once assignment, zombie-writer protection).
- Enable later **on-demand spawn** (scale-to-zero) without re-architecting.
**Non-goals (this phase)**
- Replacing Cosmos as the system of record for job/lease/event/budget **state**.
- Rewriting the scheduler's scoring math.
- Multi-region / cross-cloud dispatch.
**Hard constraints (ecosystem rules)**
- Every Cosmos doc keeps a `productId` (platform rule) — product stays a
first-class **tag**, even when it is no longer the routing key.
- Per-product budgets (`fleet_budgets /productId`), enrollment tokens (§12), and
the `tracker-web` per-product views must keep working.
- Changes must be flag-gated and reversible (match the existing
`AQ_FLEET` / `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` cutover discipline).
---
## 3. Decision summary
1. **Do NOT build A3 ("single shared queue") inside Cosmos.** A single logical
queue tempts a hot partition; scaling it forces a synthetic partition key and
a **cross-partition "find next job" query**, which *increases* RU — the
opposite of the goal. It also dissolves the per-product isolation the
platform's tenancy/budget/token model depends on.
2. **Get the shared-queue behavior from a real broker (B3), not from Cosmos.**
Adopt **Azure Service Bus** as the dispatch substrate. Cosmos remains
product-partitioned for **state**; the broker owns **delivery**.
3. **Keep the scheduler.** Use a **coordinator-owns-scheduling /
broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
factory* should run a job and pushes a **targeted** message; the broker
handles transport, visibility timeout, retries, and dead-lettering.
4. **Ship the cheap RU win first (B1) as Phase 0** — it is reversible, needs no
new infra, and de-risks the broker migration by removing the bleed while the
bigger change is built and shadowed.
> Net: the shared-queue *experience* (generic workers, one work stream) comes
> from Service Bus topics/subscriptions; Cosmos stays `/productId`-partitioned
> for state, budgets, and visibility.
---
## 4. Target architecture
### 4.1 Components & ownership
| Concern | Owner (target) | Notes |
| --- | --- | --- |
| Job/lease/event/budget **state** | **Cosmos** (`/productId`, `/jobId` as today) | unchanged system of record |
| **Scheduling** (which factory) | **Coordinator** (platform-service) | existing weighted scorer + preemption |
| **Dispatch / delivery** | **Service Bus** | competing consumers, visibility timeout, DLQ |
| **Fencing** (zombie writers) | **Cosmos `leaseEpoch`** | broker visibility ≠ correctness boundary |
| Per-product billing/budgets/tokens | **Cosmos + coordinator** | enforced at submit + assign, not by partition |
| Control planes | `tracker-web`, `agent-queue` dashboard | unchanged REST surface |
### 4.2 Service Bus topology
- One **topic** `fleet-dispatch`.
- **Primary model — coordinator-targeted (preserves the scheduler):** the
coordinator picks the factory, then publishes a message stamped with
`targetFactoryId`. Each factory has its **own subscription** with a
**correlation filter** `targetFactoryId = '<me>'`. The broker does no policy —
it just delivers the scorer's decision. **This is the model the rest of this
doc assumes.**
- **Fallback model — self-select (only if the scheduler is disabled):**
capability/repo **SQL filters** on message application properties let consumers
self-match. Multi-valued `capabilities` do **not** filter cleanly as one
string, so encode each as a boolean property (`cap_os_mac=true`,
`repo_learning_ai_notes=true`) rather than `LIKE '%…%'`. Subscription filters
are why Service Bus beats Storage Queue / SQS (which can't filter → a
queue-per-class sprawl).
- **Messages stay small.** A message carries only
`{ jobId, productId, repo, caps, priority, targetFactoryId }`**not**
`bodyMd`/manifest. The consumer reads the full job from Cosmos by `jobId`.
(Service Bus max message is **256 KB** Standard / 1 MB Premium; job bodies can
approach that — reinforcing "broker = transport, Cosmos = state".)
- **DLQ** per subscription ⇒ maps onto `failed` / `retries_exhausted`.
- **Sessions** (optional) keyed by `repo` to serialize same-repo work and avoid
worktree/branch contention on one host.
### 4.3 Why this keeps the scheduler
A vanilla broker is FIFO competing-consumers and does **no** weighted scoring.
To preserve the existing scorer (`capabilityFit / affinity / load / costFit /
health / starvation`) + preemption + seat limits, the coordinator stays in the
decision path: it **selects the target factory** and publishes a message whose
filter routes it to *that* factory's subscription (or a per-factory
subscription). The broker is transport, not policy.
---
## 5. Key flows
### 5.1 Submit → dispatch (consistency)
The **Cosmos change feed on `fleet_jobs` is the durable, ordered event log**, so
no separate outbox container is needed for the primary design:
1. `submitJob` writes the `fleet_jobs` doc (`stage: queued`). That write *is* the
event.
2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
feed (via a lease container), runs the scheduler for each new/`queued` job,
stamps `assignedFactoryId` on the job (CAS), and **publishes** the targeted
Service Bus message.
3. **Crash-safe & idempotent:** the change feed redelivers from the last
checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
**`MessageId = jobId`** collapses re-publishes. The consumer is idempotent
because the authoritative claim is a Cosmos CAS on `leaseEpoch` — a second
delivery is simply fenced (`leaseEpoch` is assigned *at claim*, so it is **not**
a valid dedup key for the message itself).
> A separate **transactional outbox** is only needed if you ever publish *inline*
> at submit instead of via the change feed. Cross-container writes are **not
> atomic** in Cosmos, so an outbox row would have to live in the **same container
> + same partition** as the job and be written with a **Cosmos transactional
> batch** — or, simpler, carried as an `outboxState` field on the job doc itself.
> The change-feed design avoids this entirely.
> Net effect: the per-factory busy-poll is replaced by one change-feed-driven
> dispatcher. Idle cost is event-driven, not a per-3s full-partition scan.
### 5.2 Deliver → claim → fence
1. Factory **receives** a message (long-poll/`receiveMessages`, no RU).
2. Factory calls `POST /fleet/claim` (or a lighter `/fleet/accept`) with
`{ jobId, factoryId }`. Coordinator does the **CAS lease** in Cosmos exactly
as today (`revUpdateJob` + `leaseEpoch` bump) and returns the new epoch.
409 ⇒ fenced ⇒ factory abandons the message (it goes back / to DLQ).
3. The **broker lock** governs redelivery (a dead consumer's message reappears);
the **Cosmos `leaseEpoch`** governs *correctness* (a zombie writer is rejected
on PATCH). Two distinct mechanisms — do not collapse them.
4. **Long-running jobs vs the broker lock.** Service Bus message lock max is
**5 minutes**; a coding job runs far longer. Two viable patterns:
- **(recommended) complete-on-claim:** complete the message immediately after
a successful Cosmos claim. The **Cosmos lease + reaper** then own liveness —
on crash the reaper sets the job back to `queued`, which is a change-feed
event that **re-dispatches** (§5.1). This decouples job runtime from the
5-min lock entirely.
- **renew-lock:** keep the message locked and call `renewMessageLock` on a
timer, reusing the existing `AQ_FLEET_LEASE_RENEW_SEC` cadence to renew
*both* the Cosmos lease and the broker lock. Simpler delivery semantics, but
couples runtime to the broker and risks redelivery storms on long jobs.
### 5.3 Failure / retry / DLQ
- Engine failure / verify-fail ⇒ coordinator transitions `failed`; broker
message completed (don't redeliver a logical failure).
- Crash / lease-expiry ⇒ broker redelivers after visibility timeout **and** the
existing reaper reclaims the Cosmos lease (bumps epoch). The redelivered
message re-runs `/fleet/claim`, which fences the old holder.
- Exhausted retries ⇒ broker DLQ + Cosmos `retries_exhausted`.
### 5.4 Routing model (the §1.1 fix)
- Job carries `repo` + required `capabilities` (real tokens: `os:*`, `engine:*`,
`has:git`, plus a new `repo:<name>` token).
- The **scheduler** does the matching: it picks among factories that advertise
those caps **and** have the repo locally (or can clone it), then targets the
winner (§4.2 primary model: message stamped `targetFactoryId`, delivered via
that factory's correlation-filtered subscription).
- **Product is a property/tag** used for billing/visibility and budget checks —
**not** the routing key. (In the self-select fallback, product/caps/repo become
subscription SQL filters instead.)
- Fix the `tracker-web` form in lockstep: derive factories/repos from live data,
drop the bogus default `capabilities = "build"`, and stop hardcoding
`mac-1/mac-2`.
---
## 6. Per-product tenancy without product-partitioned queues
- **Budgets:** checked by the coordinator at **assign time** (it already reads
`fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
dispatcher.
- **Tokens (§12):** the factory token still scopes `productId + capabilities +
factoryId`; the coordinator enforces it on `/fleet/claim`. A factory may
consume cross-product messages **only** for products its token covers — model
this as per-product (or per-token) subscriptions/filters so least-privilege is
preserved.
- **Visibility:** `tracker-web` keeps querying per product (state is still
product-partitioned), so the UX is unchanged.
---
## 7. Alternatives considered
| Option | Verdict | Reason |
| --- | --- | --- |
| **A3 shared queue in Cosmos** | ✗ | hot partition; cross-partition claim = more RU; loses tenancy isolation |
| **A1 validate ownership only** | partial | fixes "wrong factory" but not the RU/poll model or process-per-product |
| **Storage Queue / SQS broker** | ✗ (for now) | no subscription filters ⇒ queue-per-capability sprawl; weaker DLQ/visibility ergonomics |
| **B2 change feed, no broker** | viable | good for dispatch signal, but still needs a transport to *reach* factories; pairs naturally with B3 |
| **Plain competing-consumers (drop scheduler)** | ✗ | throws away weighted scoring + preemption + cost/affinity routing |
| **B3 Service Bus + coordinator hybrid** | ✓ chosen | zero idle RU, keeps scheduler + fencing, filters give capability/repo routing, paves path to B4 |
---
## 8. Phased migration
> Steps are labelled **M0M3** to avoid collision with the roadmap's Phase 05
> numbering; all of M0M3 sit *inside* roadmap **Phase 4**. The ticked checklist
> is in §12.
### M0 — RU quick win (no new infra, fully reversible) — *do now*
- Add a per-product `queue_version`/`pending_count` doc bumped on submit/stage
change. The factory's loop does a **1-RU point-read** of that doc each tick and
only runs the expensive `listJobs`/claim when it changed.
- Raise `POLL_SECONDS` (e.g. 3 → 1530) and add jittered backoff when idle.
- Expected: **~1050× fewer claim queries at idle**, behavior otherwise
identical. Gate behind a flag; trivially revertible.
### M1 — Stand up the broker in **shadow**
- Provision Service Bus (`fleet-dispatch` topic + subscriptions) with
**managed-identity** auth (no connection-string keys in env/`.env`). Coordinator
publishes messages **in parallel** with the existing claim path but factories
still source work from Cosmos. Use the existing `AQ_FLEET_SHADOW` discipline:
record divergence (did the broker route match the scorer's pick?) without
acting on it.
### M2 — Cutover delivery to the broker
- Flip a flag so factories source work from Service Bus + `/fleet/claim` for
fencing; Cosmos poll path becomes the fallback only. Keep the reaper + lease
fencing untouched. Validate exactly-once + crash recovery on multi-host.
### M3 — On-demand factories (B4)
- KEDA / Container Apps scale-to-zero on subscription depth: spin a factory only
when depth > 0; idle ⇒ **zero** running workers and zero RU. Warm-pool a single
small instance if cold-start latency matters.
---
## 9. Risks & mitigations
| Risk | Mitigation |
| --- | --- |
| Dual source-of-truth (broker + Cosmos) drift | change-feed *is* the log (no separate outbox); SB duplicate-detection on `MessageId=jobId`; claim is a Cosmos CAS on `leaseEpoch` |
| Broker lock vs `leaseEpoch` confusion | explicit rule: broker lock = *delivery*, `leaseEpoch` = *correctness*; never merge (§5.2) |
| Long job > 5-min broker lock | **complete-on-claim** (reaper + change feed re-dispatch) or `renewMessageLock` on the lease cadence (§5.2) |
| Message > 256 KB | message carries `jobId` + routing props only; consumer reads body from Cosmos (§4.2) |
| Same-repo worktree contention across hosts | Service Bus **sessions** keyed by `repo` to serialize same-repo jobs |
| Lost scheduler features under FIFO | coordinator keeps assignment; broker only transports targeted messages |
| Token scope leak in shared subscriptions | per-factory subscription + correlation filter; coordinator re-checks the §12 token on claim |
| Secrets in env (`.env` keys) | **managed identity** for Service Bus + Cosmos; no connection-string keys committed |
| Blind operation | emit metrics: subscription depth, dispatch lag, claim-conflict (409) rate, DLQ count, change-feed lag — wire to existing monitoring |
| Migration regressions | M1 shadow measures divergence before any cutover; all flag-gated |
---
## 10. Open questions
1. **Per-factory subscription scale.** The chosen coordinator-targeted model uses
one subscription per factory (correlation filter on `targetFactoryId`). Service
Bus allows up to **2,000 subscriptions/topic**, so this scales for realistic
fleets. If factory churn is high, fall back to a single subscription with a
per-consumer `targetFactoryId` SQL filter.
2. **Where does the dispatcher run?** A new lightweight loop in platform-service
vs a separate worker. A change-feed lease container is required either way; a
single active dispatcher (leader-elected) avoids double-publish.
3. **Cost envelope:** Service Bus tier (Standard vs Premium). Standard likely
sufficient; Premium only if sessions/large messages/VNet are needed. Confirm
against expected message volume.
4. **Do we keep the Cosmos poll path permanently** as an offline/degraded
fallback (like today's `AQ_FLEET_ROUTE=0`)? Recommend yes.
5. **Repo advertisement.** How does a factory tell the coordinator which repos it
has locally (for the `repo:<name>` capability)? Extend the heartbeat payload
with a `repos[]` list, or derive from `AQ_FLEET_REPO_BASE`.
---
## 11. Appendix — idle RU cost sketch (today vs M0 vs target)
| Model | Claim/work-find ops at idle (4 factories) | Notes |
| --- | --- | --- |
| **Today** (poll 3s) | ~115k/day full-partition `listJobs` | scales with partition size; ~`4 × 28.8k` |
| **M0** (poll 1530s + gate) | ~1223k/day **1-RU point-reads** + ~0 full scans | full scan only when the gate doc changes |
| **Target (B3)** | ~0 | long-poll receive, no RU; full scan never on the hot path |
> Figures are order-of-magnitude to frame the decision, not a billing estimate.
> A full-partition `listJobs` costs many RU and grows with partition size; a
> point-read is ~1 RU and flat. The point: idle cost goes from "linear in
> partition size, forever" to "≈ zero".
---
## 12. Roadmap & checklist (roadmap Phase 4)
Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the
"wrong-factory / ineligible-capability" stranding is gone, exactly-once
assignment + crash recovery still hold on multi-host, and every step is
flag-gated + reversible.
### M0 — RU quick win (no new infra)
- [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change.
- [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed.
- [ ] Raise `POLL_SECONDS` default (3 → 1530) + jittered idle backoff.
- [ ] Flag-gate (e.g. `AQ_FLEET_GATE=1`) with a clean off-switch.
- [ ] Verify: idle claim queries drop ~1050×; functional behavior unchanged (selftest green).
### Routing-model fix (lands with M0/M1)
- [ ] Add `repo:<name>` capability token; factories advertise local repos via heartbeat (`repos[]`).
- [ ] Scheduler matches on caps **+ repo**; product becomes a tag, not the routing key.
- [ ] Fix `tracker-web` New-Job form: drop default `capabilities="build"`, stop hardcoding `mac-1/mac-2`, derive factories/repos from live data.
- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
### M1 — Broker in shadow
- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions (managed identity, no keys).
- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, publishes targeted messages (`MessageId=jobId`, dup-detection on).
- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
### M2 — Cutover delivery
- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
### M3 — On-demand factories (scale-to-zero)
- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
### Docs to update on completion
- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.
- [ ] common_plat `docs/GIGAFACTORY/` — mirror the backend/dispatcher changes.