bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
saravanakumardb1 9f24a7fdd0 docs(gigafactory): add error-handling & cleanup section + v3 review fixes
Adds §5.5 (lease-release-on-failure, branch/worktree GC, same-repo worktree
clobber) with target invariants, plus a §12 checklist block. v3 review:
unify targetFactoryId, reconcile §5.3 with complete-on-claim, align §6 token
scoping with per-factory subscriptions, M0 wording.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 22:45:21 -07:00

452 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet Dispatch Redesign — Broker-Backed, On-Demand Factories
> Design proposal (no code yet). Companion to `GIGAFACTORY_SYSTEM_OVERVIEW.md`
> (what exists today) and `GIGAFACTORY_ROADMAP.md` (source-of-truth spec). This
> doc realizes roadmap **Phase 4** ("Message bus + autoscaling") and the
> routing-model cleanup that comes with it. Last reviewed: **2026-05-31**.
>
> **Review log**
> - v1 (2026-05-31): initial proposal.
> - v2 (2026-05-31): self-review pass — reconciled the routing model
> (coordinator-targeted as primary), fixed the Cosmos outbox transactionality
> claim (change feed *is* the log), constrained message size (jobId + routing
> props only), addressed long-job vs Service Bus 5-min lock, corrected the
> idempotency key (`MessageId = jobId`), renamed migration steps `M0M3` to
> avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
> ticked roadmap checklist + auth/observability notes.
> - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior +
> lease-release-on-failure, branch/worktree GC, same-repo worktree clobber).
> Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled
> §5.3 with the complete-on-claim model (broker is not the redelivery path),
> aligned §6 token scoping with per-factory subscriptions, and added the GC /
> `POST /fleet/fail` checklist block to §12.
---
## 1. Why this doc exists (the two smells)
Two structural problems surfaced while running the local fleet against
`tracker-web` + `platform-service`:
### 1.1 Product-as-queue is conflated with repo-as-work-target
- `fleet_jobs` is partitioned by **`/productId`**, and a factory is bound to a
single product via `AQ_PRODUCT_ID`. The job's **`repo`** is just a payload
field (the PR target). Routing uses `productId`; the repo is orthogonal.
- Consequence observed: a `learning_ai_notes` job submitted via the form was
filed under **`chronomind`** (because the form's Factory dropdown maps
`mac-2 → chronomind`), and would have opened a PR to the notes repo from a
"chronomind" factory. Nothing ties the product to the repo, and nothing
guarantees the chosen factory even has that repo checked out.
- The form (`dashboards/tracker-web/.../fleet/jobs/page.tsx`) hardcodes
`FLEET_FACTORIES = [mac-1→lysnrai, mac-2→chronomind]` and defaults
`capabilities = "build"` — a capability **no agent-queue factory ever
advertises** (`detect_capabilities` only emits `os:*`, `engine:*`, `node:*`,
`has:*`). So default UI submissions are unroutable to live factories.
### 1.2 Pull-poll daemons burn Cosmos RU to stay "ready"
- The run loop iterates every **`POLL_SECONDS=3`**; with `AQ_FLEET_ROUTE=1`
(default) each iteration calls `POST /fleet/claim`.
- `claimNextJob` runs `repo.listJobs({ productId })` — **reads every job doc in
the product partition, no stage filter, no limit** — on every claim, plus a
`getLease` point-read per active job when preemption is on.
- One process **per product** (`_start_fleet.sh` spawns 4) ⇒ ~`4 × (1/3s)` ≈
**115k claim queries/day at idle**, each scaling with partition size, billed
continuously whether or not work exists. The machine must also stay up running
the loop.
> **Root cause:** `productId` is doing double duty as *tenant/billing scope* and
> *work-routing queue*, and work discovery is a busy-poll against the state store.
---
## 2. Goals, non-goals, constraints
**Goals**
- Eliminate idle-poll RU cost; pay (near) zero when there is no work.
- Make a factory a **generic build worker** (host + capabilities + engines +
checked-out repos), not a product-bound process.
- Route work by what actually matters (**capabilities + repo**), while keeping
per-product **billing, budgets, visibility, and token scoping**.
- Preserve the existing **weighted scheduler** and **leaseEpoch fencing**
(exactly-once assignment, zombie-writer protection).
- Enable later **on-demand spawn** (scale-to-zero) without re-architecting.
**Non-goals (this phase)**
- Replacing Cosmos as the system of record for job/lease/event/budget **state**.
- Rewriting the scheduler's scoring math.
- Multi-region / cross-cloud dispatch.
**Hard constraints (ecosystem rules)**
- Every Cosmos doc keeps a `productId` (platform rule) — product stays a
first-class **tag**, even when it is no longer the routing key.
- Per-product budgets (`fleet_budgets /productId`), enrollment tokens (§12), and
the `tracker-web` per-product views must keep working.
- Changes must be flag-gated and reversible (match the existing
`AQ_FLEET` / `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` cutover discipline).
---
## 3. Decision summary
1. **Do NOT build A3 ("single shared queue") inside Cosmos.** A single logical
queue tempts a hot partition; scaling it forces a synthetic partition key and
a **cross-partition "find next job" query**, which *increases* RU — the
opposite of the goal. It also dissolves the per-product isolation the
platform's tenancy/budget/token model depends on.
2. **Get the shared-queue behavior from a real broker (B3), not from Cosmos.**
Adopt **Azure Service Bus** as the dispatch substrate. Cosmos remains
product-partitioned for **state**; the broker owns **delivery**.
3. **Keep the scheduler.** Use a **coordinator-owns-scheduling /
broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
factory* should run a job and pushes a **targeted** message; the broker
handles transport, visibility timeout, retries, and dead-lettering.
4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no
new infra, and de-risks the broker migration by removing the bleed while the
bigger change is built and shadowed.
> Net: the shared-queue *experience* (generic workers, one work stream) comes
> from Service Bus topics/subscriptions; Cosmos stays `/productId`-partitioned
> for state, budgets, and visibility.
---
## 4. Target architecture
### 4.1 Components & ownership
| Concern | Owner (target) | Notes |
| --- | --- | --- |
| Job/lease/event/budget **state** | **Cosmos** (`/productId`, `/jobId` as today) | unchanged system of record |
| **Scheduling** (which factory) | **Coordinator** (platform-service) | existing weighted scorer + preemption |
| **Dispatch / delivery** | **Service Bus** | competing consumers, visibility timeout, DLQ |
| **Fencing** (zombie writers) | **Cosmos `leaseEpoch`** | broker visibility ≠ correctness boundary |
| Per-product billing/budgets/tokens | **Cosmos + coordinator** | enforced at submit + assign, not by partition |
| Control planes | `tracker-web`, `agent-queue` dashboard | unchanged REST surface |
### 4.2 Service Bus topology
- One **topic** `fleet-dispatch`.
- **Primary model — coordinator-targeted (preserves the scheduler):** the
coordinator picks the factory, then publishes a message stamped with
`targetFactoryId`. Each factory has its **own subscription** with a
**correlation filter** `targetFactoryId = '<me>'`. The broker does no policy —
it just delivers the scorer's decision. **This is the model the rest of this
doc assumes.**
- **Fallback model — self-select (only if the scheduler is disabled):**
capability/repo **SQL filters** on message application properties let consumers
self-match. Multi-valued `capabilities` do **not** filter cleanly as one
string, so encode each as a boolean property (`cap_os_mac=true`,
`repo_learning_ai_notes=true`) rather than `LIKE '%…%'`. Subscription filters
are why Service Bus beats Storage Queue / SQS (which can't filter → a
queue-per-class sprawl).
- **Messages stay small.** A message carries only
`{ jobId, productId, repo, caps, priority, targetFactoryId }`**not**
`bodyMd`/manifest. The consumer reads the full job from Cosmos by `jobId`.
(Service Bus max message is **256 KB** Standard / 1 MB Premium; job bodies can
approach that — reinforcing "broker = transport, Cosmos = state".)
- **DLQ** per subscription ⇒ maps onto `failed` / `retries_exhausted`.
- **Sessions** (optional) keyed by `repo` to serialize same-repo work and avoid
worktree/branch contention on one host.
### 4.3 Why this keeps the scheduler
A vanilla broker is FIFO competing-consumers and does **no** weighted scoring.
To preserve the existing scorer (`capabilityFit / affinity / load / costFit /
health / starvation`) + preemption + seat limits, the coordinator stays in the
decision path: it **selects the target factory** and publishes a message whose
filter routes it to *that* factory's subscription (or a per-factory
subscription). The broker is transport, not policy.
---
## 5. Key flows
### 5.1 Submit → dispatch (consistency)
The **Cosmos change feed on `fleet_jobs` is the durable, ordered event log**, so
no separate outbox container is needed for the primary design:
1. `submitJob` writes the `fleet_jobs` doc (`stage: queued`). That write *is* the
event.
2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
feed (via a lease container), runs the scheduler for each new/`queued` job,
stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted
Service Bus message.
3. **Crash-safe & idempotent:** the change feed redelivers from the last
checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
**`MessageId = jobId`** collapses re-publishes. The consumer is idempotent
because the authoritative claim is a Cosmos CAS on `leaseEpoch` — a second
delivery is simply fenced (`leaseEpoch` is assigned *at claim*, so it is **not**
a valid dedup key for the message itself).
> A separate **transactional outbox** is only needed if you ever publish *inline*
> at submit instead of via the change feed. Cross-container writes are **not
> atomic** in Cosmos, so an outbox row would have to live in the **same container
> + same partition** as the job and be written with a **Cosmos transactional
> batch** — or, simpler, carried as an `outboxState` field on the job doc itself.
> The change-feed design avoids this entirely.
> Net effect: the per-factory busy-poll is replaced by one change-feed-driven
> dispatcher. Idle cost is event-driven, not a per-3s full-partition scan.
### 5.2 Deliver → claim → fence
1. Factory **receives** a message (long-poll/`receiveMessages`, no RU).
2. Factory calls `POST /fleet/claim` (or a lighter `/fleet/accept`) with
`{ jobId, factoryId }`. Coordinator does the **CAS lease** in Cosmos exactly
as today (`revUpdateJob` + `leaseEpoch` bump) and returns the new epoch.
409 ⇒ fenced ⇒ factory abandons the message (it goes back / to DLQ).
3. The **broker lock** governs redelivery (a dead consumer's message reappears);
the **Cosmos `leaseEpoch`** governs *correctness* (a zombie writer is rejected
on PATCH). Two distinct mechanisms — do not collapse them.
4. **Long-running jobs vs the broker lock.** Service Bus message lock max is
**5 minutes**; a coding job runs far longer. Two viable patterns:
- **(recommended) complete-on-claim:** complete the message immediately after
a successful Cosmos claim. The **Cosmos lease + reaper** then own liveness —
on crash the reaper sets the job back to `queued`, which is a change-feed
event that **re-dispatches** (§5.1). This decouples job runtime from the
5-min lock entirely.
- **renew-lock:** keep the message locked and call `renewMessageLock` on a
timer, reusing the existing `AQ_FLEET_LEASE_RENEW_SEC` cadence to renew
*both* the Cosmos lease and the broker lock. Simpler delivery semantics, but
couples runtime to the broker and risks redelivery storms on long jobs.
### 5.3 Failure / retry / DLQ
Assumes the recommended **complete-on-claim** model (§5.2): the broker message is
completed at claim, so the broker is **not** the redelivery path — re-dispatch is
driven by Cosmos stage changes through the change feed (§5.1).
- **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions
`failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5);
no redelivery (a logical failure is terminal unless a retry policy applies).
- **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++,
backoff) ⇒ change-feed re-dispatch to the next best factory.
- **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps
`leaseEpoch`, fencing the dead holder) and returns the job to `queued`
change-feed re-dispatch. (With the alternative *renew-lock* model, broker
redelivery is the trigger instead — pick one, not both.)
- **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ
for visibility.
### 5.4 Routing model (the §1.1 fix)
- Job carries `repo` + required `capabilities` (real tokens: `os:*`, `engine:*`,
`has:git`, plus a new `repo:<name>` token).
- The **scheduler** does the matching: it picks among factories that advertise
those caps **and** have the repo locally (or can clone it), then targets the
winner (§4.2 primary model: message stamped `targetFactoryId`, delivered via
that factory's correlation-filtered subscription).
- **Product is a property/tag** used for billing/visibility and budget checks —
**not** the routing key. (In the self-select fallback, product/caps/repo become
subscription SQL filters instead.)
- Fix the `tracker-web` form in lockstep: derive factories/repos from live data,
drop the bogus default `capabilities = "build"`, and stop hardcoding
`mac-1/mac-2`.
### 5.5 Error handling & cleanup (worktrees, branches, leases)
**Today (single-host, agent-queue.sh).** The worker already handles errors well:
the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/
`capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a
retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP
checkpoint to `aq/wip/<job>` on **every** exit path; `recover_orphans` requeues
dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`)
triggers `fleet_quarantine``failed/` that **never ships or merges**
(split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the
PR branch `aq/job/<jid>` is deleted on auto-merge (`--delete-branch`); the repo
worktree is force-recreated at the next job for that repo.
**Gaps this redesign must close.** These are real loose ends in the current code:
1. **No client-side lease release on failure.** `_finish_failure` is
fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the
reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued`
+ release lease) so failure is reflected and the lease freed **immediately**.
2. **Unbounded git artifacts.** `aq/wip/<job>` branches are never GC'd; worktrees
are cleaned only on reuse; unmerged `aq/job/<jid>` branches accumulate on
origin when auto-merge is off or blocked by branch protection. Target: a
periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and
sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is
force-recreated, so two same-repo jobs on one host collide. Target: **Service
Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host,
repo)` lock as a local backstop.
**Target invariants.**
- Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ
mirrors `retries_exhausted` for visibility.
- Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued`
change-feed re-dispatch (§5.3).
- Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch
with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
---
## 6. Per-product tenancy without product-partitioned queues
- **Budgets:** checked by the coordinator at **assign time** (it already reads
`fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
dispatcher.
- **Tokens (§12):** the factory token still scopes `productId + capabilities +
factoryId`. In the primary (coordinator-targeted) model the dispatcher only
ever targets a factory the scheduler deemed eligible, and the coordinator
**re-checks the token on `/fleet/claim`** — so least-privilege holds without
relying on the subscription topology. (In the self-select fallback, scope it
with per-product/per-token subscription filters instead.)
- **Visibility:** `tracker-web` keeps querying per product (state is still
product-partitioned), so the UX is unchanged.
---
## 7. Alternatives considered
| Option | Verdict | Reason |
| --- | --- | --- |
| **A3 shared queue in Cosmos** | ✗ | hot partition; cross-partition claim = more RU; loses tenancy isolation |
| **A1 validate ownership only** | partial | fixes "wrong factory" but not the RU/poll model or process-per-product |
| **Storage Queue / SQS broker** | ✗ (for now) | no subscription filters ⇒ queue-per-capability sprawl; weaker DLQ/visibility ergonomics |
| **B2 change feed, no broker** | viable | good for dispatch signal, but still needs a transport to *reach* factories; pairs naturally with B3 |
| **Plain competing-consumers (drop scheduler)** | ✗ | throws away weighted scoring + preemption + cost/affinity routing |
| **B3 Service Bus + coordinator hybrid** | ✓ chosen | zero idle RU, keeps scheduler + fencing, filters give capability/repo routing, paves path to B4 |
---
## 8. Phased migration
> Steps are labelled **M0M3** to avoid collision with the roadmap's Phase 05
> numbering; all of M0M3 sit *inside* roadmap **Phase 4**. The ticked checklist
> is in §12.
### M0 — RU quick win (no new infra, fully reversible) — *do now*
- Add a per-product `queue_version`/`pending_count` doc bumped on submit/stage
change. The factory's loop does a **1-RU point-read** of that doc each tick and
only runs the expensive `listJobs`/claim when it changed.
- Raise `POLL_SECONDS` (e.g. 3 → 1530) and add jittered backoff when idle.
- Expected: **~1050× fewer claim queries at idle**, behavior otherwise
identical. Gate behind a flag; trivially revertible.
### M1 — Stand up the broker in **shadow**
- Provision Service Bus (`fleet-dispatch` topic + subscriptions) with
**managed-identity** auth (no connection-string keys in env/`.env`). Coordinator
publishes messages **in parallel** with the existing claim path but factories
still source work from Cosmos. Use the existing `AQ_FLEET_SHADOW` discipline:
record divergence (did the broker route match the scorer's pick?) without
acting on it.
### M2 — Cutover delivery to the broker
- Flip a flag so factories source work from Service Bus + `/fleet/claim` for
fencing; Cosmos poll path becomes the fallback only. Keep the reaper + lease
fencing untouched. Validate exactly-once + crash recovery on multi-host.
### M3 — On-demand factories (B4)
- KEDA / Container Apps scale-to-zero on subscription depth: spin a factory only
when depth > 0; idle ⇒ **zero** running workers and zero RU. Warm-pool a single
small instance if cold-start latency matters.
---
## 9. Risks & mitigations
| Risk | Mitigation |
| --- | --- |
| Dual source-of-truth (broker + Cosmos) drift | change-feed *is* the log (no separate outbox); SB duplicate-detection on `MessageId=jobId`; claim is a Cosmos CAS on `leaseEpoch` |
| Broker lock vs `leaseEpoch` confusion | explicit rule: broker lock = *delivery*, `leaseEpoch` = *correctness*; never merge (§5.2) |
| Long job > 5-min broker lock | **complete-on-claim** (reaper + change feed re-dispatch) or `renewMessageLock` on the lease cadence (§5.2) |
| Message > 256 KB | message carries `jobId` + routing props only; consumer reads body from Cosmos (§4.2) |
| Same-repo worktree contention across hosts | Service Bus **sessions** keyed by `repo` to serialize same-repo jobs |
| Lost scheduler features under FIFO | coordinator keeps assignment; broker only transports targeted messages |
| Token scope leak in shared subscriptions | per-factory subscription + correlation filter; coordinator re-checks the §12 token on claim |
| Secrets in env (`.env` keys) | **managed identity** for Service Bus + Cosmos; no connection-string keys committed |
| Blind operation | emit metrics: subscription depth, dispatch lag, claim-conflict (409) rate, DLQ count, change-feed lag — wire to existing monitoring |
| Migration regressions | M1 shadow measures divergence before any cutover; all flag-gated |
---
## 10. Open questions
1. **Per-factory subscription scale.** The chosen coordinator-targeted model uses
one subscription per factory (correlation filter on `targetFactoryId`). Service
Bus allows up to **2,000 subscriptions/topic**, so this scales for realistic
fleets. If factory churn is high, fall back to a single subscription with a
per-consumer `targetFactoryId` SQL filter.
2. **Where does the dispatcher run?** A new lightweight loop in platform-service
vs a separate worker. A change-feed lease container is required either way; a
single active dispatcher (leader-elected) avoids double-publish.
3. **Cost envelope:** Service Bus tier (Standard vs Premium). Standard likely
sufficient; Premium only if sessions/large messages/VNet are needed. Confirm
against expected message volume.
4. **Do we keep the Cosmos poll path permanently** as an offline/degraded
fallback (like today's `AQ_FLEET_ROUTE=0`)? Recommend yes.
5. **Repo advertisement.** How does a factory tell the coordinator which repos it
has locally (for the `repo:<name>` capability)? Extend the heartbeat payload
with a `repos[]` list, or derive from `AQ_FLEET_REPO_BASE`.
---
## 11. Appendix — idle RU cost sketch (today vs M0 vs target)
| Model | Claim/work-find ops at idle (4 factories) | Notes |
| --- | --- | --- |
| **Today** (poll 3s) | ~115k/day full-partition `listJobs` | scales with partition size; ~`4 × 28.8k` |
| **M0** (poll 1530s + gate) | ~1223k/day **1-RU point-reads** + ~0 full scans | full scan only when the gate doc changes |
| **Target (B3)** | ~0 | long-poll receive, no RU; full scan never on the hot path |
> Figures are order-of-magnitude to frame the decision, not a billing estimate.
> A full-partition `listJobs` costs many RU and grows with partition size; a
> point-read is ~1 RU and flat. The point: idle cost goes from "linear in
> partition size, forever" to "≈ zero".
---
## 12. Roadmap & checklist (roadmap Phase 4)
Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the
"wrong-factory / ineligible-capability" stranding is gone, exactly-once
assignment + crash recovery still hold on multi-host, and every step is
flag-gated + reversible.
### M0 — RU quick win (no new infra)
- [ ] Add per-product `queue_version`/`pending_count` doc; bump on submit + stage change.
- [ ] Factory loop point-reads the gate each tick; run `listJobs`/claim only when it changed.
- [ ] Raise `POLL_SECONDS` default (3 → 1530) + jittered idle backoff.
- [ ] Flag-gate (e.g. `AQ_FLEET_GATE=1`) with a clean off-switch.
- [ ] Verify: idle claim queries drop ~1050×; functional behavior unchanged (selftest green).
### Routing-model fix (lands with M0/M1)
- [ ] Add `repo:<name>` capability token; factories advertise local repos via heartbeat (`repos[]`).
- [ ] Scheduler matches on caps **+ repo**; product becomes a tag, not the routing key.
- [ ] Fix `tracker-web` New-Job form: drop default `capabilities="build"`, stop hardcoding `mac-1/mac-2`, derive factories/repos from live data.
- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
### M1 — Broker in shadow
- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions (managed identity, no keys).
- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, publishes targeted messages (`MessageId=jobId`, dup-detection on).
- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
### M2 — Cutover delivery
- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
### Error handling & cleanup (lands with M2) — see §5.5
- [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`.
- [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
- [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock.
- [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
### M3 — On-demand factories (scale-to-zero)
- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
### Docs to update on completion
- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.
- [ ] common_plat `docs/GIGAFACTORY/` — mirror the backend/dispatcher changes.