Adds §5.5 (lease-release-on-failure, branch/worktree GC, same-repo worktree clobber) with target invariants, plus a §12 checklist block. v3 review: unify targetFactoryId, reconcile §5.3 with complete-on-claim, align §6 token scoping with per-factory subscriptions, M0 wording. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
26 KiB
Fleet Dispatch Redesign — Broker-Backed, On-Demand Factories
Design proposal (no code yet). Companion to
GIGAFACTORY_SYSTEM_OVERVIEW.md(what exists today) andGIGAFACTORY_ROADMAP.md(source-of-truth spec). This doc realizes roadmap Phase 4 ("Message bus + autoscaling") and the routing-model cleanup that comes with it. Last reviewed: 2026-05-31.Review log
- v1 (2026-05-31): initial proposal.
- v2 (2026-05-31): self-review pass — reconciled the routing model (coordinator-targeted as primary), fixed the Cosmos outbox transactionality claim (change feed is the log), constrained message size (jobId + routing props only), addressed long-job vs Service Bus 5-min lock, corrected the idempotency key (
MessageId = jobId), renamed migration stepsM0–M3to avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a ticked roadmap checklist + auth/observability notes.- v3 (2026-05-31): added §5.5 Error handling & cleanup (current behavior + lease-release-on-failure, branch/worktree GC, same-repo worktree clobber). Review fixes: unified the field name to
targetFactoryId(§5.1), reconciled §5.3 with the complete-on-claim model (broker is not the redelivery path), aligned §6 token scoping with per-factory subscriptions, and added the GC /POST /fleet/failchecklist block to §12.
1. Why this doc exists (the two smells)
Two structural problems surfaced while running the local fleet against
tracker-web + platform-service:
1.1 Product-as-queue is conflated with repo-as-work-target
fleet_jobsis partitioned by/productId, and a factory is bound to a single product viaAQ_PRODUCT_ID. The job'srepois just a payload field (the PR target). Routing usesproductId; the repo is orthogonal.- Consequence observed: a
learning_ai_notesjob submitted via the form was filed underchronomind(because the form's Factory dropdown mapsmac-2 → chronomind), and would have opened a PR to the notes repo from a "chronomind" factory. Nothing ties the product to the repo, and nothing guarantees the chosen factory even has that repo checked out. - The form (
dashboards/tracker-web/.../fleet/jobs/page.tsx) hardcodesFLEET_FACTORIES = [mac-1→lysnrai, mac-2→chronomind]and defaultscapabilities = "build"— a capability no agent-queue factory ever advertises (detect_capabilitiesonly emitsos:*,engine:*,node:*,has:*). So default UI submissions are unroutable to live factories.
1.2 Pull-poll daemons burn Cosmos RU to stay "ready"
- The run loop iterates every
POLL_SECONDS=3; withAQ_FLEET_ROUTE=1(default) each iteration callsPOST /fleet/claim. claimNextJobrunsrepo.listJobs({ productId })— reads every job doc in the product partition, no stage filter, no limit — on every claim, plus agetLeasepoint-read per active job when preemption is on.- One process per product (
_start_fleet.shspawns 4) ⇒ ~4 × (1/3s)≈ 115k claim queries/day at idle, each scaling with partition size, billed continuously whether or not work exists. The machine must also stay up running the loop.
Root cause:
productIdis doing double duty as tenant/billing scope and work-routing queue, and work discovery is a busy-poll against the state store.
2. Goals, non-goals, constraints
Goals
- Eliminate idle-poll RU cost; pay (near) zero when there is no work.
- Make a factory a generic build worker (host + capabilities + engines + checked-out repos), not a product-bound process.
- Route work by what actually matters (capabilities + repo), while keeping per-product billing, budgets, visibility, and token scoping.
- Preserve the existing weighted scheduler and leaseEpoch fencing (exactly-once assignment, zombie-writer protection).
- Enable later on-demand spawn (scale-to-zero) without re-architecting.
Non-goals (this phase)
- Replacing Cosmos as the system of record for job/lease/event/budget state.
- Rewriting the scheduler's scoring math.
- Multi-region / cross-cloud dispatch.
Hard constraints (ecosystem rules)
- Every Cosmos doc keeps a
productId(platform rule) — product stays a first-class tag, even when it is no longer the routing key. - Per-product budgets (
fleet_budgets /productId), enrollment tokens (§12), and thetracker-webper-product views must keep working. - Changes must be flag-gated and reversible (match the existing
AQ_FLEET/AQ_FLEET_ROUTE/AQ_FLEET_SHADOWcutover discipline).
3. Decision summary
- Do NOT build A3 ("single shared queue") inside Cosmos. A single logical queue tempts a hot partition; scaling it forces a synthetic partition key and a cross-partition "find next job" query, which increases RU — the opposite of the goal. It also dissolves the per-product isolation the platform's tenancy/budget/token model depends on.
- Get the shared-queue behavior from a real broker (B3), not from Cosmos. Adopt Azure Service Bus as the dispatch substrate. Cosmos remains product-partitioned for state; the broker owns delivery.
- Keep the scheduler. Use a coordinator-owns-scheduling / broker-owns-delivery hybrid (B2 ⊕ B3): the coordinator decides which factory should run a job and pushes a targeted message; the broker handles transport, visibility timeout, retries, and dead-lettering.
- Ship the cheap RU win first (B1) as step M0 — it is reversible, needs no new infra, and de-risks the broker migration by removing the bleed while the bigger change is built and shadowed.
Net: the shared-queue experience (generic workers, one work stream) comes from Service Bus topics/subscriptions; Cosmos stays
/productId-partitioned for state, budgets, and visibility.
4. Target architecture
4.1 Components & ownership
| Concern | Owner (target) | Notes |
|---|---|---|
| Job/lease/event/budget state | Cosmos (/productId, /jobId as today) |
unchanged system of record |
| Scheduling (which factory) | Coordinator (platform-service) | existing weighted scorer + preemption |
| Dispatch / delivery | Service Bus | competing consumers, visibility timeout, DLQ |
| Fencing (zombie writers) | Cosmos leaseEpoch |
broker visibility ≠ correctness boundary |
| Per-product billing/budgets/tokens | Cosmos + coordinator | enforced at submit + assign, not by partition |
| Control planes | tracker-web, agent-queue dashboard |
unchanged REST surface |
4.2 Service Bus topology
- One topic
fleet-dispatch. - Primary model — coordinator-targeted (preserves the scheduler): the
coordinator picks the factory, then publishes a message stamped with
targetFactoryId. Each factory has its own subscription with a correlation filtertargetFactoryId = '<me>'. The broker does no policy — it just delivers the scorer's decision. This is the model the rest of this doc assumes. - Fallback model — self-select (only if the scheduler is disabled):
capability/repo SQL filters on message application properties let consumers
self-match. Multi-valued
capabilitiesdo not filter cleanly as one string, so encode each as a boolean property (cap_os_mac=true,repo_learning_ai_notes=true) rather thanLIKE '%…%'. Subscription filters are why Service Bus beats Storage Queue / SQS (which can't filter → a queue-per-class sprawl). - Messages stay small. A message carries only
{ jobId, productId, repo, caps, priority, targetFactoryId }— notbodyMd/manifest. The consumer reads the full job from Cosmos byjobId. (Service Bus max message is 256 KB Standard / 1 MB Premium; job bodies can approach that — reinforcing "broker = transport, Cosmos = state".) - DLQ per subscription ⇒ maps onto
failed/retries_exhausted. - Sessions (optional) keyed by
repoto serialize same-repo work and avoid worktree/branch contention on one host.
4.3 Why this keeps the scheduler
A vanilla broker is FIFO competing-consumers and does no weighted scoring.
To preserve the existing scorer (capabilityFit / affinity / load / costFit / health / starvation) + preemption + seat limits, the coordinator stays in the
decision path: it selects the target factory and publishes a message whose
filter routes it to that factory's subscription (or a per-factory
subscription). The broker is transport, not policy.
5. Key flows
5.1 Submit → dispatch (consistency)
The Cosmos change feed on fleet_jobs is the durable, ordered event log, so
no separate outbox container is needed for the primary design:
submitJobwrites thefleet_jobsdoc (stage: queued). That write is the event.- A single dispatcher (coordinator process) tails the
fleet_jobschange feed (via a lease container), runs the scheduler for each new/queuedjob, stampstargetFactoryIdon the job (CAS), and publishes the targeted Service Bus message. - Crash-safe & idempotent: the change feed redelivers from the last
checkpoint on dispatcher restart; Service Bus duplicate detection keyed on
MessageId = jobIdcollapses re-publishes. The consumer is idempotent because the authoritative claim is a Cosmos CAS onleaseEpoch— a second delivery is simply fenced (leaseEpochis assigned at claim, so it is not a valid dedup key for the message itself).
A separate transactional outbox is only needed if you ever publish inline at submit instead of via the change feed. Cross-container writes are not atomic in Cosmos, so an outbox row would have to live in the **same container
- same partition** as the job and be written with a Cosmos transactional batch — or, simpler, carried as an
outboxStatefield on the job doc itself. The change-feed design avoids this entirely.
Net effect: the per-factory busy-poll is replaced by one change-feed-driven dispatcher. Idle cost is event-driven, not a per-3s full-partition scan.
5.2 Deliver → claim → fence
- Factory receives a message (long-poll/
receiveMessages, no RU). - Factory calls
POST /fleet/claim(or a lighter/fleet/accept) with{ jobId, factoryId }. Coordinator does the CAS lease in Cosmos exactly as today (revUpdateJob+leaseEpochbump) and returns the new epoch. 409 ⇒ fenced ⇒ factory abandons the message (it goes back / to DLQ). - The broker lock governs redelivery (a dead consumer's message reappears);
the Cosmos
leaseEpochgoverns correctness (a zombie writer is rejected on PATCH). Two distinct mechanisms — do not collapse them. - Long-running jobs vs the broker lock. Service Bus message lock max is
5 minutes; a coding job runs far longer. Two viable patterns:
- (recommended) complete-on-claim: complete the message immediately after
a successful Cosmos claim. The Cosmos lease + reaper then own liveness —
on crash the reaper sets the job back to
queued, which is a change-feed event that re-dispatches (§5.1). This decouples job runtime from the 5-min lock entirely. - renew-lock: keep the message locked and call
renewMessageLockon a timer, reusing the existingAQ_FLEET_LEASE_RENEW_SECcadence to renew both the Cosmos lease and the broker lock. Simpler delivery semantics, but couples runtime to the broker and risks redelivery storms on long jobs.
- (recommended) complete-on-claim: complete the message immediately after
a successful Cosmos claim. The Cosmos lease + reaper then own liveness —
on crash the reaper sets the job back to
5.3 Failure / retry / DLQ
Assumes the recommended complete-on-claim model (§5.2): the broker message is completed at claim, so the broker is not the redelivery path — re-dispatch is driven by Cosmos stage changes through the change feed (§5.1).
- Logical failure (engine error / verify-fail) ⇒ coordinator transitions
failedand releases the lease immediately (new/fleet/fail, see §5.5); no redelivery (a logical failure is terminal unless a retry policy applies). - Retryable failure ⇒ coordinator sets the job back to
queued(attempts++, backoff) ⇒ change-feed re-dispatch to the next best factory. - Crash / lease-expiry ⇒ the reaper reclaims the Cosmos lease (bumps
leaseEpoch, fencing the dead holder) and returns the job toqueued⇒ change-feed re-dispatch. (With the alternative renew-lock model, broker redelivery is the trigger instead — pick one, not both.) - Exhausted retries ⇒ Cosmos
retries_exhausted; mirror to the broker DLQ for visibility.
5.4 Routing model (the §1.1 fix)
- Job carries
repo+ requiredcapabilities(real tokens:os:*,engine:*,has:git, plus a newrepo:<name>token). - The scheduler does the matching: it picks among factories that advertise
those caps and have the repo locally (or can clone it), then targets the
winner (§4.2 primary model: message stamped
targetFactoryId, delivered via that factory's correlation-filtered subscription). - Product is a property/tag used for billing/visibility and budget checks — not the routing key. (In the self-select fallback, product/caps/repo become subscription SQL filters instead.)
- Fix the
tracker-webform in lockstep: derive factories/repos from live data, drop the bogus defaultcapabilities = "build", and stop hardcodingmac-1/mac-2.
5.5 Error handling & cleanup (worktrees, branches, leases)
Today (single-host, agent-queue.sh). The worker already handles errors well:
the stage machine routes timeout/budget_exceeded/crash/verify_failed/
capability_mismatch/no_engine through _finish_failure (→ failed/, with a
retry policy that requeues to inbox/ with backoff); a trap writes a WIP
checkpoint to aq/wip/<job> on every exit path; recover_orphans requeues
dead-worker building/ jobs; and a FENCED report (stale leaseEpoch)
triggers fleet_quarantine → failed/ that never ships or merges
(split-brain guard). PR/merge cleanup: .aq_pr.md is removed before commit; the
PR branch aq/job/<jid> is deleted on auto-merge (--delete-branch); the repo
worktree is force-recreated at the next job for that repo.
Gaps this redesign must close. These are real loose ends in the current code:
- No client-side lease release on failure.
_finish_failureis fleet-agnostic, so a failed fleet job's lease only frees on expiry via the reaper — slow recovery. Target: aPOST /fleet/fail(stage=failed/queued- release lease) so failure is reflected and the lease freed immediately.
- Unbounded git artifacts.
aq/wip/<job>branches are never GC'd; worktrees are cleaned only on reuse; unmergedaq/job/<jid>branches accumulate on origin when auto-merge is off or blocked by branch protection. Target: a periodic GC sweep — delete mergedaq/job/*, prune stale worktrees, and sweepaq/wip/*after a job reaches a terminal/shipped stage. - Same-repo concurrency can clobber a worktree. The per-repo worktree is
force-recreated, so two same-repo jobs on one host collide. Target: Service
Bus sessions keyed by
repo(serialize same-repo work) plus a per-(host, repo)lock as a local backstop.
Target invariants.
- Terminal failure ⇒ Cosmos
failed+ lease released now (no expiry wait); DLQ mirrorsretries_exhaustedfor visibility. - Crash / fence ⇒ reaper bumps
leaseEpoch(fences zombie) ⇒queued⇒ change-feed re-dispatch (§5.3). - Cleanup is explicit and idempotent — safe to re-run, never deletes a branch with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
6. Per-product tenancy without product-partitioned queues
- Budgets: checked by the coordinator at assign time (it already reads
fleet_budgets /productIdinclaimNextJob); unchanged, just moved to the dispatcher. - Tokens (§12): the factory token still scopes
productId + capabilities + factoryId. In the primary (coordinator-targeted) model the dispatcher only ever targets a factory the scheduler deemed eligible, and the coordinator re-checks the token on/fleet/claim— so least-privilege holds without relying on the subscription topology. (In the self-select fallback, scope it with per-product/per-token subscription filters instead.) - Visibility:
tracker-webkeeps querying per product (state is still product-partitioned), so the UX is unchanged.
7. Alternatives considered
| Option | Verdict | Reason |
|---|---|---|
| A3 shared queue in Cosmos | ✗ | hot partition; cross-partition claim = more RU; loses tenancy isolation |
| A1 validate ownership only | partial | fixes "wrong factory" but not the RU/poll model or process-per-product |
| Storage Queue / SQS broker | ✗ (for now) | no subscription filters ⇒ queue-per-capability sprawl; weaker DLQ/visibility ergonomics |
| B2 change feed, no broker | viable | good for dispatch signal, but still needs a transport to reach factories; pairs naturally with B3 |
| Plain competing-consumers (drop scheduler) | ✗ | throws away weighted scoring + preemption + cost/affinity routing |
| B3 Service Bus + coordinator hybrid | ✓ chosen | zero idle RU, keeps scheduler + fencing, filters give capability/repo routing, paves path to B4 |
8. Phased migration
Steps are labelled M0–M3 to avoid collision with the roadmap's Phase 0–5 numbering; all of M0–M3 sit inside roadmap Phase 4. The ticked checklist is in §12.
M0 — RU quick win (no new infra, fully reversible) — do now
- Add a per-product
queue_version/pending_countdoc bumped on submit/stage change. The factory's loop does a 1-RU point-read of that doc each tick and only runs the expensivelistJobs/claim when it changed. - Raise
POLL_SECONDS(e.g. 3 → 15–30) and add jittered backoff when idle. - Expected: ~10–50× fewer claim queries at idle, behavior otherwise identical. Gate behind a flag; trivially revertible.
M1 — Stand up the broker in shadow
- Provision Service Bus (
fleet-dispatchtopic + subscriptions) with managed-identity auth (no connection-string keys in env/.env). Coordinator publishes messages in parallel with the existing claim path but factories still source work from Cosmos. Use the existingAQ_FLEET_SHADOWdiscipline: record divergence (did the broker route match the scorer's pick?) without acting on it.
M2 — Cutover delivery to the broker
- Flip a flag so factories source work from Service Bus +
/fleet/claimfor fencing; Cosmos poll path becomes the fallback only. Keep the reaper + lease fencing untouched. Validate exactly-once + crash recovery on multi-host.
M3 — On-demand factories (B4)
- KEDA / Container Apps scale-to-zero on subscription depth: spin a factory only when depth > 0; idle ⇒ zero running workers and zero RU. Warm-pool a single small instance if cold-start latency matters.
9. Risks & mitigations
| Risk | Mitigation |
|---|---|
| Dual source-of-truth (broker + Cosmos) drift | change-feed is the log (no separate outbox); SB duplicate-detection on MessageId=jobId; claim is a Cosmos CAS on leaseEpoch |
Broker lock vs leaseEpoch confusion |
explicit rule: broker lock = delivery, leaseEpoch = correctness; never merge (§5.2) |
| Long job > 5-min broker lock | complete-on-claim (reaper + change feed re-dispatch) or renewMessageLock on the lease cadence (§5.2) |
| Message > 256 KB | message carries jobId + routing props only; consumer reads body from Cosmos (§4.2) |
| Same-repo worktree contention across hosts | Service Bus sessions keyed by repo to serialize same-repo jobs |
| Lost scheduler features under FIFO | coordinator keeps assignment; broker only transports targeted messages |
| Token scope leak in shared subscriptions | per-factory subscription + correlation filter; coordinator re-checks the §12 token on claim |
Secrets in env (.env keys) |
managed identity for Service Bus + Cosmos; no connection-string keys committed |
| Blind operation | emit metrics: subscription depth, dispatch lag, claim-conflict (409) rate, DLQ count, change-feed lag — wire to existing monitoring |
| Migration regressions | M1 shadow measures divergence before any cutover; all flag-gated |
10. Open questions
- Per-factory subscription scale. The chosen coordinator-targeted model uses
one subscription per factory (correlation filter on
targetFactoryId). Service Bus allows up to 2,000 subscriptions/topic, so this scales for realistic fleets. If factory churn is high, fall back to a single subscription with a per-consumertargetFactoryIdSQL filter. - Where does the dispatcher run? A new lightweight loop in platform-service vs a separate worker. A change-feed lease container is required either way; a single active dispatcher (leader-elected) avoids double-publish.
- Cost envelope: Service Bus tier (Standard vs Premium). Standard likely sufficient; Premium only if sessions/large messages/VNet are needed. Confirm against expected message volume.
- Do we keep the Cosmos poll path permanently as an offline/degraded
fallback (like today's
AQ_FLEET_ROUTE=0)? Recommend yes. - Repo advertisement. How does a factory tell the coordinator which repos it
has locally (for the
repo:<name>capability)? Extend the heartbeat payload with arepos[]list, or derive fromAQ_FLEET_REPO_BASE.
11. Appendix — idle RU cost sketch (today vs M0 vs target)
| Model | Claim/work-find ops at idle (4 factories) | Notes |
|---|---|---|
| Today (poll 3s) | ~115k/day full-partition listJobs |
scales with partition size; ~4 × 28.8k |
| M0 (poll 15–30s + gate) | ~12–23k/day 1-RU point-reads + ~0 full scans | full scan only when the gate doc changes |
| Target (B3) | ~0 | long-poll receive, no RU; full scan never on the hot path |
Figures are order-of-magnitude to frame the decision, not a billing estimate. A full-partition
listJobscosts many RU and grows with partition size; a point-read is ~1 RU and flat. The point: idle cost goes from "linear in partition size, forever" to "≈ zero".
12. Roadmap & checklist (roadmap Phase 4)
Acceptance gate for the whole effort: idle work-find RU ≈ 0, the "wrong-factory / ineligible-capability" stranding is gone, exactly-once assignment + crash recovery still hold on multi-host, and every step is flag-gated + reversible.
M0 — RU quick win (no new infra)
- Add per-product
queue_version/pending_countdoc; bump on submit + stage change. - Factory loop point-reads the gate each tick; run
listJobs/claim only when it changed. - Raise
POLL_SECONDSdefault (3 → 15–30) + jittered idle backoff. - Flag-gate (e.g.
AQ_FLEET_GATE=1) with a clean off-switch. - Verify: idle claim queries drop ~10–50×; functional behavior unchanged (selftest green).
Routing-model fix (lands with M0/M1)
- Add
repo:<name>capability token; factories advertise local repos via heartbeat (repos[]). - Scheduler matches on caps + repo; product becomes a tag, not the routing key.
- Fix
tracker-webNew-Job form: drop defaultcapabilities="build", stop hardcodingmac-1/mac-2, derive factories/repos from live data. - Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
M1 — Broker in shadow
- Provision Service Bus
fleet-dispatchtopic + per-factory subscriptions (managed identity, no keys). - Change-feed dispatcher (leader-elected) tails
fleet_jobs, runs scheduler, publishes targeted messages (MessageId=jobId, dup-detection on). - Publish in shadow alongside the Cosmos claim path; record route divergence (no action taken).
- Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
M2 — Cutover delivery
- Factories consume from Service Bus;
/fleet/acceptdoes the Cosmos CAS claim + returnsleaseEpoch. - Implement complete-on-claim (reaper + change-feed re-dispatch owns liveness).
- Cosmos poll path retained as flag-gated fallback (
AQ_FLEET_ROUTE=0). - Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
- Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔
failed/retries_exhaustedmapping correct.
Error handling & cleanup (lands with M2) — see §5.5
- Add
POST /fleet/failso a failed job sets the coordinator stage + releases the lease immediately (no expiry wait); wire it into_finish_failure/fleet_quarantine. - GC sweep (idempotent): delete merged
aq/job/*branches, prune stale worktrees, sweepaq/wip/*after a job reaches a terminal/shipped stage. - Prevent same-repo worktree clobber: Service Bus sessions keyed by
repo+ a per-(host, repo)local lock. - Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
M3 — On-demand factories (scale-to-zero)
- KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
- Optional warm-pool (1 small instance) if cold-start latency matters.
- Verify: zero idle workers + zero idle RU; cold-start latency within target.
Docs to update on completion
GIGAFACTORY_ROADMAP.md— tick Phase 4; correct the stale §0 progress table.GIGAFACTORY_SYSTEM_OVERVIEW.md— add the broker/dispatcher to the architecture + code map.- common_plat
docs/GIGAFACTORY/— mirror the backend/dispatcher changes.