Adds AQ_FLEET_GATE (default OFF): the run loop point-reads the cheap per-product queue version (GET /fleet/queue-state) and SKIPS the expensive /fleet/claim while the version is unchanged and it is not mid-drain, with a periodic safety backstop and fail-open-on-read-error so work is never stranded. Keeps POLL_SECONDS for local job responsiveness rather than raising it globally. selftest 39b covers the gate decisions; reconciles the M0 section of the dispatch redesign doc. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
516 lines
30 KiB
Markdown
516 lines
30 KiB
Markdown
# Fleet Dispatch Redesign — Broker-Backed, On-Demand Factories
|
||
|
||
> Design proposal (no code yet). Companion to `GIGAFACTORY_SYSTEM_OVERVIEW.md`
|
||
> (what exists today) and `GIGAFACTORY_ROADMAP.md` (source-of-truth spec). This
|
||
> doc realizes roadmap **Phase 4** ("Message bus + autoscaling") and the
|
||
> routing-model cleanup that comes with it. Last reviewed: **2026-05-31**.
|
||
>
|
||
> **Review log**
|
||
> - v1 (2026-05-31): initial proposal.
|
||
> - v2 (2026-05-31): self-review pass — reconciled the routing model
|
||
> (coordinator-targeted as primary), fixed the Cosmos outbox transactionality
|
||
> claim (change feed *is* the log), constrained message size (jobId + routing
|
||
> props only), addressed long-job vs Service Bus 5-min lock, corrected the
|
||
> idempotency key (`MessageId = jobId`), renamed migration steps `M0–M3` to
|
||
> avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
|
||
> ticked roadmap checklist + auth/observability notes.
|
||
> - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior +
|
||
> lease-release-on-failure, branch/worktree GC, same-repo worktree clobber).
|
||
> Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled
|
||
> §5.3 with the complete-on-claim model (broker is not the redelivery path),
|
||
> aligned §6 token scoping with per-factory subscriptions, and added the GC /
|
||
> `POST /fleet/fail` checklist block to §12.
|
||
> - v4 (2026-05-31): **coverage audit** — roadmap now maps 1:1 to the design via a
|
||
> coverage matrix. Closed plan gaps: **M-prep** (decisions/§10 + schema +
|
||
> containers + RBAC), correlation filter + dispatcher budget enforcement (M1),
|
||
> small-messages/body-from-Cosmos + token re-check + alerting (M2), and new
|
||
> **Testing** and **Rollback & flags** blocks. No design element is now without
|
||
> an implementation step.
|
||
> - v5 (2026-05-31): **M0 implemented + shipped** (`fleet_queue_state` + bump
|
||
> hooks + `GET /fleet/queue-state` in common_plat; `AQ_FLEET_GATE` gate-skip in
|
||
> agent-queue). Reconciled M0 to the as-built approach (gate the *claim*; keep
|
||
> `POLL_SECONDS` for local responsiveness rather than raising it globally) and
|
||
> ticked the M0 checklist. Backend vitest + gate logic verified.
|
||
|
||
---
|
||
|
||
## 1. Why this doc exists (the two smells)
|
||
|
||
Two structural problems surfaced while running the local fleet against
|
||
`tracker-web` + `platform-service`:
|
||
|
||
### 1.1 Product-as-queue is conflated with repo-as-work-target
|
||
|
||
- `fleet_jobs` is partitioned by **`/productId`**, and a factory is bound to a
|
||
single product via `AQ_PRODUCT_ID`. The job's **`repo`** is just a payload
|
||
field (the PR target). Routing uses `productId`; the repo is orthogonal.
|
||
- Consequence observed: a `learning_ai_notes` job submitted via the form was
|
||
filed under **`chronomind`** (because the form's Factory dropdown maps
|
||
`mac-2 → chronomind`), and would have opened a PR to the notes repo from a
|
||
"chronomind" factory. Nothing ties the product to the repo, and nothing
|
||
guarantees the chosen factory even has that repo checked out.
|
||
- The form (`dashboards/tracker-web/.../fleet/jobs/page.tsx`) hardcodes
|
||
`FLEET_FACTORIES = [mac-1→lysnrai, mac-2→chronomind]` and defaults
|
||
`capabilities = "build"` — a capability **no agent-queue factory ever
|
||
advertises** (`detect_capabilities` only emits `os:*`, `engine:*`, `node:*`,
|
||
`has:*`). So default UI submissions are unroutable to live factories.
|
||
|
||
### 1.2 Pull-poll daemons burn Cosmos RU to stay "ready"
|
||
|
||
- The run loop iterates every **`POLL_SECONDS=3`**; with `AQ_FLEET_ROUTE=1`
|
||
(default) each iteration calls `POST /fleet/claim`.
|
||
- `claimNextJob` runs `repo.listJobs({ productId })` — **reads every job doc in
|
||
the product partition, no stage filter, no limit** — on every claim, plus a
|
||
`getLease` point-read per active job when preemption is on.
|
||
- One process **per product** (`_start_fleet.sh` spawns 4) ⇒ ~`4 × (1/3s)` ≈
|
||
**115k claim queries/day at idle**, each scaling with partition size, billed
|
||
continuously whether or not work exists. The machine must also stay up running
|
||
the loop.
|
||
|
||
> **Root cause:** `productId` is doing double duty as *tenant/billing scope* and
|
||
> *work-routing queue*, and work discovery is a busy-poll against the state store.
|
||
|
||
---
|
||
|
||
## 2. Goals, non-goals, constraints
|
||
|
||
**Goals**
|
||
- Eliminate idle-poll RU cost; pay (near) zero when there is no work.
|
||
- Make a factory a **generic build worker** (host + capabilities + engines +
|
||
checked-out repos), not a product-bound process.
|
||
- Route work by what actually matters (**capabilities + repo**), while keeping
|
||
per-product **billing, budgets, visibility, and token scoping**.
|
||
- Preserve the existing **weighted scheduler** and **leaseEpoch fencing**
|
||
(exactly-once assignment, zombie-writer protection).
|
||
- Enable later **on-demand spawn** (scale-to-zero) without re-architecting.
|
||
|
||
**Non-goals (this phase)**
|
||
- Replacing Cosmos as the system of record for job/lease/event/budget **state**.
|
||
- Rewriting the scheduler's scoring math.
|
||
- Multi-region / cross-cloud dispatch.
|
||
|
||
**Hard constraints (ecosystem rules)**
|
||
- Every Cosmos doc keeps a `productId` (platform rule) — product stays a
|
||
first-class **tag**, even when it is no longer the routing key.
|
||
- Per-product budgets (`fleet_budgets /productId`), enrollment tokens (§12), and
|
||
the `tracker-web` per-product views must keep working.
|
||
- Changes must be flag-gated and reversible (match the existing
|
||
`AQ_FLEET` / `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` cutover discipline).
|
||
|
||
---
|
||
|
||
## 3. Decision summary
|
||
|
||
1. **Do NOT build A3 ("single shared queue") inside Cosmos.** A single logical
|
||
queue tempts a hot partition; scaling it forces a synthetic partition key and
|
||
a **cross-partition "find next job" query**, which *increases* RU — the
|
||
opposite of the goal. It also dissolves the per-product isolation the
|
||
platform's tenancy/budget/token model depends on.
|
||
2. **Get the shared-queue behavior from a real broker (B3), not from Cosmos.**
|
||
Adopt **Azure Service Bus** as the dispatch substrate. Cosmos remains
|
||
product-partitioned for **state**; the broker owns **delivery**.
|
||
3. **Keep the scheduler.** Use a **coordinator-owns-scheduling /
|
||
broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
|
||
factory* should run a job and pushes a **targeted** message; the broker
|
||
handles transport, visibility timeout, retries, and dead-lettering.
|
||
4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no
|
||
new infra, and de-risks the broker migration by removing the bleed while the
|
||
bigger change is built and shadowed.
|
||
|
||
> Net: the shared-queue *experience* (generic workers, one work stream) comes
|
||
> from Service Bus topics/subscriptions; Cosmos stays `/productId`-partitioned
|
||
> for state, budgets, and visibility.
|
||
|
||
---
|
||
|
||
## 4. Target architecture
|
||
|
||
### 4.1 Components & ownership
|
||
|
||
| Concern | Owner (target) | Notes |
|
||
| --- | --- | --- |
|
||
| Job/lease/event/budget **state** | **Cosmos** (`/productId`, `/jobId` as today) | unchanged system of record |
|
||
| **Scheduling** (which factory) | **Coordinator** (platform-service) | existing weighted scorer + preemption |
|
||
| **Dispatch / delivery** | **Service Bus** | competing consumers, visibility timeout, DLQ |
|
||
| **Fencing** (zombie writers) | **Cosmos `leaseEpoch`** | broker visibility ≠ correctness boundary |
|
||
| Per-product billing/budgets/tokens | **Cosmos + coordinator** | enforced at submit + assign, not by partition |
|
||
| Control planes | `tracker-web`, `agent-queue` dashboard | unchanged REST surface |
|
||
|
||
### 4.2 Service Bus topology
|
||
|
||
- One **topic** `fleet-dispatch`.
|
||
- **Primary model — coordinator-targeted (preserves the scheduler):** the
|
||
coordinator picks the factory, then publishes a message stamped with
|
||
`targetFactoryId`. Each factory has its **own subscription** with a
|
||
**correlation filter** `targetFactoryId = '<me>'`. The broker does no policy —
|
||
it just delivers the scorer's decision. **This is the model the rest of this
|
||
doc assumes.**
|
||
- **Fallback model — self-select (only if the scheduler is disabled):**
|
||
capability/repo **SQL filters** on message application properties let consumers
|
||
self-match. Multi-valued `capabilities` do **not** filter cleanly as one
|
||
string, so encode each as a boolean property (`cap_os_mac=true`,
|
||
`repo_learning_ai_notes=true`) rather than `LIKE '%…%'`. Subscription filters
|
||
are why Service Bus beats Storage Queue / SQS (which can't filter → a
|
||
queue-per-class sprawl).
|
||
- **Messages stay small.** A message carries only
|
||
`{ jobId, productId, repo, caps, priority, targetFactoryId }` — **not**
|
||
`bodyMd`/manifest. The consumer reads the full job from Cosmos by `jobId`.
|
||
(Service Bus max message is **256 KB** Standard / 1 MB Premium; job bodies can
|
||
approach that — reinforcing "broker = transport, Cosmos = state".)
|
||
- **DLQ** per subscription ⇒ maps onto `failed` / `retries_exhausted`.
|
||
- **Sessions** (optional) keyed by `repo` to serialize same-repo work and avoid
|
||
worktree/branch contention on one host.
|
||
|
||
### 4.3 Why this keeps the scheduler
|
||
|
||
A vanilla broker is FIFO competing-consumers and does **no** weighted scoring.
|
||
To preserve the existing scorer (`capabilityFit / affinity / load / costFit /
|
||
health / starvation`) + preemption + seat limits, the coordinator stays in the
|
||
decision path: it **selects the target factory** and publishes a message whose
|
||
filter routes it to *that* factory's subscription (or a per-factory
|
||
subscription). The broker is transport, not policy.
|
||
|
||
---
|
||
|
||
## 5. Key flows
|
||
|
||
### 5.1 Submit → dispatch (consistency)
|
||
|
||
The **Cosmos change feed on `fleet_jobs` is the durable, ordered event log**, so
|
||
no separate outbox container is needed for the primary design:
|
||
|
||
1. `submitJob` writes the `fleet_jobs` doc (`stage: queued`). That write *is* the
|
||
event.
|
||
2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
|
||
feed (via a lease container), runs the scheduler for each new/`queued` job,
|
||
stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted
|
||
Service Bus message.
|
||
3. **Crash-safe & idempotent:** the change feed redelivers from the last
|
||
checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
|
||
**`MessageId = jobId`** collapses re-publishes. The consumer is idempotent
|
||
because the authoritative claim is a Cosmos CAS on `leaseEpoch` — a second
|
||
delivery is simply fenced (`leaseEpoch` is assigned *at claim*, so it is **not**
|
||
a valid dedup key for the message itself).
|
||
|
||
> A separate **transactional outbox** is only needed if you ever publish *inline*
|
||
> at submit instead of via the change feed. Cross-container writes are **not
|
||
> atomic** in Cosmos, so an outbox row would have to live in the **same container
|
||
> + same partition** as the job and be written with a **Cosmos transactional
|
||
> batch** — or, simpler, carried as an `outboxState` field on the job doc itself.
|
||
> The change-feed design avoids this entirely.
|
||
|
||
> Net effect: the per-factory busy-poll is replaced by one change-feed-driven
|
||
> dispatcher. Idle cost is event-driven, not a per-3s full-partition scan.
|
||
|
||
### 5.2 Deliver → claim → fence
|
||
|
||
1. Factory **receives** a message (long-poll/`receiveMessages`, no RU).
|
||
2. Factory calls `POST /fleet/claim` (or a lighter `/fleet/accept`) with
|
||
`{ jobId, factoryId }`. Coordinator does the **CAS lease** in Cosmos exactly
|
||
as today (`revUpdateJob` + `leaseEpoch` bump) and returns the new epoch.
|
||
409 ⇒ fenced ⇒ factory abandons the message (it goes back / to DLQ).
|
||
3. The **broker lock** governs redelivery (a dead consumer's message reappears);
|
||
the **Cosmos `leaseEpoch`** governs *correctness* (a zombie writer is rejected
|
||
on PATCH). Two distinct mechanisms — do not collapse them.
|
||
4. **Long-running jobs vs the broker lock.** Service Bus message lock max is
|
||
**5 minutes**; a coding job runs far longer. Two viable patterns:
|
||
- **(recommended) complete-on-claim:** complete the message immediately after
|
||
a successful Cosmos claim. The **Cosmos lease + reaper** then own liveness —
|
||
on crash the reaper sets the job back to `queued`, which is a change-feed
|
||
event that **re-dispatches** (§5.1). This decouples job runtime from the
|
||
5-min lock entirely.
|
||
- **renew-lock:** keep the message locked and call `renewMessageLock` on a
|
||
timer, reusing the existing `AQ_FLEET_LEASE_RENEW_SEC` cadence to renew
|
||
*both* the Cosmos lease and the broker lock. Simpler delivery semantics, but
|
||
couples runtime to the broker and risks redelivery storms on long jobs.
|
||
|
||
### 5.3 Failure / retry / DLQ
|
||
|
||
Assumes the recommended **complete-on-claim** model (§5.2): the broker message is
|
||
completed at claim, so the broker is **not** the redelivery path — re-dispatch is
|
||
driven by Cosmos stage changes through the change feed (§5.1).
|
||
|
||
- **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions
|
||
`failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5);
|
||
no redelivery (a logical failure is terminal unless a retry policy applies).
|
||
- **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++,
|
||
backoff) ⇒ change-feed re-dispatch to the next best factory.
|
||
- **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps
|
||
`leaseEpoch`, fencing the dead holder) and returns the job to `queued` ⇒
|
||
change-feed re-dispatch. (With the alternative *renew-lock* model, broker
|
||
redelivery is the trigger instead — pick one, not both.)
|
||
- **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ
|
||
for visibility.
|
||
|
||
### 5.4 Routing model (the §1.1 fix)
|
||
|
||
- Job carries `repo` + required `capabilities` (real tokens: `os:*`, `engine:*`,
|
||
`has:git`, plus a new `repo:<name>` token).
|
||
- The **scheduler** does the matching: it picks among factories that advertise
|
||
those caps **and** have the repo locally (or can clone it), then targets the
|
||
winner (§4.2 primary model: message stamped `targetFactoryId`, delivered via
|
||
that factory's correlation-filtered subscription).
|
||
- **Product is a property/tag** used for billing/visibility and budget checks —
|
||
**not** the routing key. (In the self-select fallback, product/caps/repo become
|
||
subscription SQL filters instead.)
|
||
- Fix the `tracker-web` form in lockstep: derive factories/repos from live data,
|
||
drop the bogus default `capabilities = "build"`, and stop hardcoding
|
||
`mac-1/mac-2`.
|
||
|
||
### 5.5 Error handling & cleanup (worktrees, branches, leases)
|
||
|
||
**Today (single-host, agent-queue.sh).** The worker already handles errors well:
|
||
the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/
|
||
`capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a
|
||
retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP
|
||
checkpoint to `aq/wip/<job>` on **every** exit path; `recover_orphans` requeues
|
||
dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`)
|
||
triggers `fleet_quarantine` → `failed/` that **never ships or merges**
|
||
(split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the
|
||
PR branch `aq/job/<jid>` is deleted on auto-merge (`--delete-branch`); the repo
|
||
worktree is force-recreated at the next job for that repo.
|
||
|
||
**Gaps this redesign must close.** These are real loose ends in the current code:
|
||
|
||
1. **No client-side lease release on failure.** `_finish_failure` is
|
||
fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the
|
||
reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued`
|
||
+ release lease) so failure is reflected and the lease freed **immediately**.
|
||
2. **Unbounded git artifacts.** `aq/wip/<job>` branches are never GC'd; worktrees
|
||
are cleaned only on reuse; unmerged `aq/job/<jid>` branches accumulate on
|
||
origin when auto-merge is off or blocked by branch protection. Target: a
|
||
periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and
|
||
sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
|
||
3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is
|
||
force-recreated, so two same-repo jobs on one host collide. Target: **Service
|
||
Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host,
|
||
repo)` lock as a local backstop.
|
||
|
||
**Target invariants.**
|
||
- Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ
|
||
mirrors `retries_exhausted` for visibility.
|
||
- Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued` ⇒
|
||
change-feed re-dispatch (§5.3).
|
||
- Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch
|
||
with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
|
||
|
||
---
|
||
|
||
## 6. Per-product tenancy without product-partitioned queues
|
||
|
||
- **Budgets:** checked by the coordinator at **assign time** (it already reads
|
||
`fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
|
||
dispatcher.
|
||
- **Tokens (§12):** the factory token still scopes `productId + capabilities +
|
||
factoryId`. In the primary (coordinator-targeted) model the dispatcher only
|
||
ever targets a factory the scheduler deemed eligible, and the coordinator
|
||
**re-checks the token on `/fleet/claim`** — so least-privilege holds without
|
||
relying on the subscription topology. (In the self-select fallback, scope it
|
||
with per-product/per-token subscription filters instead.)
|
||
- **Visibility:** `tracker-web` keeps querying per product (state is still
|
||
product-partitioned), so the UX is unchanged.
|
||
|
||
---
|
||
|
||
## 7. Alternatives considered
|
||
|
||
| Option | Verdict | Reason |
|
||
| --- | --- | --- |
|
||
| **A3 shared queue in Cosmos** | ✗ | hot partition; cross-partition claim = more RU; loses tenancy isolation |
|
||
| **A1 validate ownership only** | partial | fixes "wrong factory" but not the RU/poll model or process-per-product |
|
||
| **Storage Queue / SQS broker** | ✗ (for now) | no subscription filters ⇒ queue-per-capability sprawl; weaker DLQ/visibility ergonomics |
|
||
| **B2 change feed, no broker** | viable | good for dispatch signal, but still needs a transport to *reach* factories; pairs naturally with B3 |
|
||
| **Plain competing-consumers (drop scheduler)** | ✗ | throws away weighted scoring + preemption + cost/affinity routing |
|
||
| **B3 Service Bus + coordinator hybrid** | ✓ chosen | zero idle RU, keeps scheduler + fencing, filters give capability/repo routing, paves path to B4 |
|
||
|
||
---
|
||
|
||
## 8. Phased migration
|
||
|
||
> Steps are labelled **M0–M3** to avoid collision with the roadmap's Phase 0–5
|
||
> numbering; all of M0–M3 sit *inside* roadmap **Phase 4**. The ticked checklist
|
||
> is in §12.
|
||
|
||
### M0 — RU quick win (no new infra, fully reversible) — *IMPLEMENTED*
|
||
- Per-product `fleet_queue_state` doc holds a monotonic `version`, bumped on job
|
||
create + every stage change (centralized in the repo layer, best-effort).
|
||
- The factory run loop does a **~1-RU point-read** (`GET /fleet/queue-state`) and
|
||
**skips the expensive claim** while the version is unchanged and it is not
|
||
mid-drain — rather than raising `POLL_SECONDS` globally (which would slow local,
|
||
non-fleet job pickup). A periodic safety backstop + fail-open-on-read-error
|
||
guarantee work is never stranded.
|
||
- Gated behind **`AQ_FLEET_GATE=1`** (default OFF ⇒ byte-for-byte prior behavior).
|
||
- Expected: **~10–50× fewer claim queries at idle**, local responsiveness
|
||
unchanged.
|
||
- Code: common_plat `services/platform-service/src/modules/fleet/{types,repository,routes}.ts`
|
||
+ `lib/cosmos-init.ts`; `agent-queue/lib/fleet-client.sh` (`fleet_gate_*`) + the
|
||
run-loop hook in `agent-queue.sh`. Tests: fleet vitest (repo bump + endpoint) +
|
||
selftest `39b` (gate decisions).
|
||
|
||
### M1 — Stand up the broker in **shadow**
|
||
- Provision Service Bus (`fleet-dispatch` topic + subscriptions) with
|
||
**managed-identity** auth (no connection-string keys in env/`.env`). Coordinator
|
||
publishes messages **in parallel** with the existing claim path but factories
|
||
still source work from Cosmos. Use the existing `AQ_FLEET_SHADOW` discipline:
|
||
record divergence (did the broker route match the scorer's pick?) without
|
||
acting on it.
|
||
|
||
### M2 — Cutover delivery to the broker
|
||
- Flip a flag so factories source work from Service Bus + `/fleet/claim` for
|
||
fencing; Cosmos poll path becomes the fallback only. Keep the reaper + lease
|
||
fencing untouched. Validate exactly-once + crash recovery on multi-host.
|
||
|
||
### M3 — On-demand factories (B4)
|
||
- KEDA / Container Apps scale-to-zero on subscription depth: spin a factory only
|
||
when depth > 0; idle ⇒ **zero** running workers and zero RU. Warm-pool a single
|
||
small instance if cold-start latency matters.
|
||
|
||
---
|
||
|
||
## 9. Risks & mitigations
|
||
|
||
| Risk | Mitigation |
|
||
| --- | --- |
|
||
| Dual source-of-truth (broker + Cosmos) drift | change-feed *is* the log (no separate outbox); SB duplicate-detection on `MessageId=jobId`; claim is a Cosmos CAS on `leaseEpoch` |
|
||
| Broker lock vs `leaseEpoch` confusion | explicit rule: broker lock = *delivery*, `leaseEpoch` = *correctness*; never merge (§5.2) |
|
||
| Long job > 5-min broker lock | **complete-on-claim** (reaper + change feed re-dispatch) or `renewMessageLock` on the lease cadence (§5.2) |
|
||
| Message > 256 KB | message carries `jobId` + routing props only; consumer reads body from Cosmos (§4.2) |
|
||
| Same-repo worktree contention across hosts | Service Bus **sessions** keyed by `repo` to serialize same-repo jobs |
|
||
| Lost scheduler features under FIFO | coordinator keeps assignment; broker only transports targeted messages |
|
||
| Token scope leak in shared subscriptions | per-factory subscription + correlation filter; coordinator re-checks the §12 token on claim |
|
||
| Secrets in env (`.env` keys) | **managed identity** for Service Bus + Cosmos; no connection-string keys committed |
|
||
| Blind operation | emit metrics: subscription depth, dispatch lag, claim-conflict (409) rate, DLQ count, change-feed lag — wire to existing monitoring |
|
||
| Migration regressions | M1 shadow measures divergence before any cutover; all flag-gated |
|
||
|
||
---
|
||
|
||
## 10. Open questions
|
||
|
||
1. **Per-factory subscription scale.** The chosen coordinator-targeted model uses
|
||
one subscription per factory (correlation filter on `targetFactoryId`). Service
|
||
Bus allows up to **2,000 subscriptions/topic**, so this scales for realistic
|
||
fleets. If factory churn is high, fall back to a single subscription with a
|
||
per-consumer `targetFactoryId` SQL filter.
|
||
2. **Where does the dispatcher run?** A new lightweight loop in platform-service
|
||
vs a separate worker. A change-feed lease container is required either way; a
|
||
single active dispatcher (leader-elected) avoids double-publish.
|
||
3. **Cost envelope:** Service Bus tier (Standard vs Premium). Standard likely
|
||
sufficient; Premium only if sessions/large messages/VNet are needed. Confirm
|
||
against expected message volume.
|
||
4. **Do we keep the Cosmos poll path permanently** as an offline/degraded
|
||
fallback (like today's `AQ_FLEET_ROUTE=0`)? Recommend yes.
|
||
5. **Repo advertisement.** How does a factory tell the coordinator which repos it
|
||
has locally (for the `repo:<name>` capability)? Extend the heartbeat payload
|
||
with a `repos[]` list, or derive from `AQ_FLEET_REPO_BASE`.
|
||
|
||
---
|
||
|
||
## 11. Appendix — idle RU cost sketch (today vs M0 vs target)
|
||
|
||
| Model | Claim/work-find ops at idle (4 factories) | Notes |
|
||
| --- | --- | --- |
|
||
| **Today** (poll 3s) | ~115k/day full-partition `listJobs` | scales with partition size; ~`4 × 28.8k` |
|
||
| **M0** (poll 15–30s + gate) | ~12–23k/day **1-RU point-reads** + ~0 full scans | full scan only when the gate doc changes |
|
||
| **Target (B3)** | ~0 | long-poll receive, no RU; full scan never on the hot path |
|
||
|
||
> Figures are order-of-magnitude to frame the decision, not a billing estimate.
|
||
> A full-partition `listJobs` costs many RU and grows with partition size; a
|
||
> point-read is ~1 RU and flat. The point: idle cost goes from "linear in
|
||
> partition size, forever" to "≈ zero".
|
||
|
||
---
|
||
|
||
## 12. Roadmap & checklist (roadmap Phase 4)
|
||
|
||
Acceptance gate for the whole effort: **idle work-find RU ≈ 0**, the
|
||
"wrong-factory / ineligible-capability" stranding is gone, exactly-once
|
||
assignment + crash recovery still hold on multi-host, and every step is
|
||
flag-gated + reversible.
|
||
|
||
### Coverage matrix (design → plan)
|
||
|
||
Every design element maps to a checklist block below — no design decision is left
|
||
without an implementation step.
|
||
|
||
| Design element | §ref | Plan block |
|
||
| --- | --- | --- |
|
||
| Idle-poll RU bleed | §1.2 | M0 |
|
||
| Product-as-queue / wrong-factory | §1.1, §5.4 | Routing-model fix |
|
||
| Open questions / decisions | §10 | M-prep |
|
||
| Schema, containers, RBAC | §4, §5, §6 | M-prep |
|
||
| Service Bus topic + subscriptions + filters | §4.2 | M-prep, M1 |
|
||
| Change-feed dispatcher + scheduler + targeting | §4.3, §5.1 | M1 |
|
||
| Budget enforcement at assign | §6 | M1 |
|
||
| Claim/fence + complete-on-claim | §5.2 | M2 |
|
||
| Small messages (body from Cosmos) | §4.2 | M2 |
|
||
| Token re-check at claim | §6 | M2 |
|
||
| Metrics + alerting | §9 | M2 |
|
||
| Failure→lease release, GC, same-repo clobber | §5.5 | Error handling & cleanup |
|
||
| Scale-to-zero on-demand | §3, §5.1 | M3 |
|
||
| Tests (dispatcher, CAS/fencing, GC, shadow) | §9 | Testing |
|
||
| Rollback / flags per phase | §3 | Rollback & flags |
|
||
| Doc updates | — | Docs to update |
|
||
|
||
### M-prep — Decisions & schema (closes §10; before M1)
|
||
- [ ] Lock dispatcher placement (platform-service loop vs separate worker) + **leader election** so a single active dispatcher avoids double-publish (§10 Q2).
|
||
- [ ] Lock Service Bus tier (Standard default; Premium only for sessions / large messages / VNet) (§10 Q3).
|
||
- [ ] Lock subscription model (per-factory correlation filter default; single-subscription SQL filter if factory churn is high) (§10 Q1).
|
||
- [ ] Confirm the Cosmos poll path stays as a **permanent** flag-gated fallback (`AQ_FLEET_ROUTE=0`) (§10 Q4).
|
||
- [ ] Confirm repo-advertisement source: `repos[]` in the heartbeat, derived from `AQ_FLEET_REPO_BASE` (§10 Q5).
|
||
- [ ] Schema: add `targetFactoryId` to `FleetJobDoc`, `repos[]` to `FleetFactoryDoc`; register a new `fleet_queue_state` doc (`/productId`) for the M0 gate; provision the change-feed **lease container**; update the container registry / `COSMOS_AUTO_INIT`.
|
||
- [ ] RBAC via managed identity: dispatcher = Service Bus **Sender**, factories = **Listener** on their own subscription; no shared keys committed.
|
||
|
||
### M0 — RU quick win (no new infra) — ✅ DONE
|
||
- [x] Add per-product `fleet_queue_state` doc; bump on create + every stage change (repo layer).
|
||
- [x] Factory loop point-reads the gate each tick; run the claim only when it changed / mid-drain / safety interval.
|
||
- [x] Keep `POLL_SECONDS` for local responsiveness; gate the *claim*, with a periodic safety backstop + fail-open (instead of raising the global poll interval).
|
||
- [x] Flag-gate `AQ_FLEET_GATE=1` (default OFF) with a clean off-switch.
|
||
- [x] Tests: fleet vitest (repo bump + `GET /fleet/queue-state`) + selftest `39b` (gate decisions) green; gate logic verified standalone.
|
||
|
||
### Routing-model fix (lands with M0/M1)
|
||
- [ ] Add `repo:<name>` capability token; factories advertise local repos via heartbeat (`repos[]`).
|
||
- [ ] Scheduler matches on caps **+ repo**; product becomes a tag, not the routing key.
|
||
- [ ] Fix `tracker-web` New-Job form: drop default `capabilities="build"`, stop hardcoding `mac-1/mac-2`, derive factories/repos from live data.
|
||
- [ ] Add product→repo ownership validation (reject/route mismatches) — the A1 safety net.
|
||
|
||
### M1 — Broker in shadow
|
||
- [ ] Provision Service Bus `fleet-dispatch` topic + per-factory subscriptions, each with a **correlation filter** `targetFactoryId='<id>'` (managed identity, no keys).
|
||
- [ ] Change-feed dispatcher (leader-elected) tails `fleet_jobs`, runs scheduler, stamps `targetFactoryId` (CAS), publishes targeted messages (`MessageId=jobId`, dup-detection on).
|
||
- [ ] Dispatcher enforces per-product **budget** (paused / ceiling) before publishing (relocates the `claimNextJob` budget check, §6).
|
||
- [ ] Publish in **shadow** alongside the Cosmos claim path; record route divergence (no action taken).
|
||
- [ ] Verify: ≥ N hours shadow with broker-route == scorer-pick within tolerance.
|
||
|
||
### M2 — Cutover delivery
|
||
- [ ] Factories consume from Service Bus; `/fleet/accept` does the Cosmos CAS claim + returns `leaseEpoch`.
|
||
- [ ] Messages carry `{jobId, productId, repo, caps, priority, targetFactoryId}` only; the consumer **reads the full job body from Cosmos** by `jobId` (256 KB limit, §4.2).
|
||
- [ ] `/fleet/accept` (and `/fleet/claim`) **re-checks the §12 factory token** (productId + caps + factoryId) before granting the lease.
|
||
- [ ] Implement **complete-on-claim** (reaper + change-feed re-dispatch owns liveness).
|
||
- [ ] Cosmos poll path retained as flag-gated fallback (`AQ_FLEET_ROUTE=0`).
|
||
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag — **and wire alerts** (DLQ depth > 0, dispatch lag > threshold) into existing monitoring.
|
||
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
|
||
|
||
### Error handling & cleanup (lands with M2) — see §5.5
|
||
- [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`.
|
||
- [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
|
||
- [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock.
|
||
- [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
|
||
|
||
### M3 — On-demand factories (scale-to-zero)
|
||
- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
|
||
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
|
||
- [ ] Verify: zero idle workers + zero idle RU; cold-start latency within target.
|
||
|
||
### Testing (every phase — tests are sacred)
|
||
- [ ] Unit: dispatcher scheduling + publish, claim CAS + `leaseEpoch` fencing, `/fleet/fail`, GC idempotency, the M0 gate read/skip logic.
|
||
- [ ] Integration: shadow-divergence harness (M1), exactly-once + crash recovery (M2), scale-to-zero behavior (M3).
|
||
- [ ] Extend `agent-queue/selftest.sh` + platform-service `vitest`; **CI green is the gate** to advance each phase.
|
||
|
||
### Rollback & flags (per phase)
|
||
- [ ] Each phase ships behind a flag with a documented one-line rollback: M0 `AQ_FLEET_GATE`, M1 shadow (publishes but never acts), M2 `AQ_FLEET_ROUTE` / broker-source toggle, M3 scaler disable.
|
||
- [ ] Verify each rollback returns to the prior working path with **no data loss** and no stranded leases/messages.
|
||
|
||
### Docs to update on completion
|
||
- [ ] `GIGAFACTORY_ROADMAP.md` — tick Phase 4; correct the stale §0 progress table.
|
||
- [ ] `GIGAFACTORY_SYSTEM_OVERVIEW.md` — add the broker/dispatcher to the architecture + code map.
|
||
- [ ] common_plat `docs/GIGAFACTORY/` — mirror the backend/dispatcher changes.
|