docs(gigafactory): add error-handling & cleanup section + v3 review fixes

Adds §5.5 (lease-release-on-failure, branch/worktree GC, same-repo worktree clobber) with target invariants, plus a §12 checklist block. v3 review: unify targetFactoryId, reconcile §5.3 with complete-on-claim, align §6 token scoping with per-factory subscriptions, M0 wording. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 22:45:21 -07:00 · 2026-05-31 22:45:21 -07:00 · 9f24a7fdd0
commit 9f24a7fdd0
parent 8217a864e9
1 changed files with 71 additions and 12 deletions
--- a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
+++ b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
@ -14,6 +14,12 @@
 >   idempotency key (`MessageId = jobId`), renamed migration steps `M0–M3` to
 >   avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
 >   ticked roadmap checklist + auth/observability notes.
 > - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior +
 >   lease-release-on-failure, branch/worktree GC, same-repo worktree clobber).
 >   Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled
 >   §5.3 with the complete-on-claim model (broker is not the redelivery path),
 >   aligned §6 token scoping with per-factory subscriptions, and added the GC /
 >   `POST /fleet/fail` checklist block to §12.
 ---
@ -96,7 +102,7 @@ Two structural problems surfaced while running the local fleet against
   broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
   factory* should run a job and pushes a **targeted** message; the broker
   handles transport, visibility timeout, retries, and dead-lettering.
-4. **Ship the cheap RU win first (B1) as Phase 0** — it is reversible, needs no
+4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no
   new infra, and de-risks the broker migration by removing the bleed while the
   bigger change is built and shadowed.
@ -166,7 +172,7 @@ no separate outbox container is needed for the primary design:
   event.
 2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
   feed (via a lease container), runs the scheduler for each new/`queued` job,
-   stamps `assignedFactoryId` on the job (CAS), and **publishes** the targeted
+   stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted
   Service Bus message.
 3. **Crash-safe & idempotent:** the change feed redelivers from the last
   checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
@ -209,12 +215,21 @@ no separate outbox container is needed for the primary design:
 ### 5.3 Failure / retry / DLQ
- Engine failure / verify-fail ⇒ coordinator transitions `failed`; broker
+Assumes the recommended **complete-on-claim** model (§5.2): the broker message is
-  message completed (don't redeliver a logical failure).
+completed at claim, so the broker is **not** the redelivery path — re-dispatch is
- Crash / lease-expiry ⇒ broker redelivers after visibility timeout **and** the
+driven by Cosmos stage changes through the change feed (§5.1).
-  existing reaper reclaims the Cosmos lease (bumps epoch). The redelivered
+
-  message re-runs `/fleet/claim`, which fences the old holder.
+- **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions
- Exhausted retries ⇒ broker DLQ + Cosmos `retries_exhausted`.
+  `failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5);
  no redelivery (a logical failure is terminal unless a retry policy applies).
 - **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++,
  backoff) ⇒ change-feed re-dispatch to the next best factory.
 - **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps
  `leaseEpoch`, fencing the dead holder) and returns the job to `queued` ⇒
  change-feed re-dispatch. (With the alternative *renew-lock* model, broker
  redelivery is the trigger instead — pick one, not both.)
 - **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ
  for visibility.
 ### 5.4 Routing model (the §1.1 fix)
@ -231,6 +246,43 @@ no separate outbox container is needed for the primary design:
  drop the bogus default `capabilities = "build"`, and stop hardcoding
  `mac-1/mac-2`.
 ### 5.5 Error handling & cleanup (worktrees, branches, leases)
 **Today (single-host, agent-queue.sh).** The worker already handles errors well:
 the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/
 `capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a
 retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP
 checkpoint to `aq/wip/<job>` on **every** exit path; `recover_orphans` requeues
 dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`)
 triggers `fleet_quarantine` → `failed/` that **never ships or merges**
 (split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the
 PR branch `aq/job/<jid>` is deleted on auto-merge (`--delete-branch`); the repo
 worktree is force-recreated at the next job for that repo.
 **Gaps this redesign must close.** These are real loose ends in the current code:
 1. **No client-side lease release on failure.** `_finish_failure` is
   fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the
   reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued`
   + release lease) so failure is reflected and the lease freed **immediately**.
 2. **Unbounded git artifacts.** `aq/wip/<job>` branches are never GC'd; worktrees
   are cleaned only on reuse; unmerged `aq/job/<jid>` branches accumulate on
   origin when auto-merge is off or blocked by branch protection. Target: a
   periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and
   sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
 3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is
   force-recreated, so two same-repo jobs on one host collide. Target: **Service
   Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host,
   repo)` lock as a local backstop.
 **Target invariants.**
 - Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ
  mirrors `retries_exhausted` for visibility.
 - Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued` ⇒
  change-feed re-dispatch (§5.3).
 - Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch
  with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
 ---
 ## 6. Per-product tenancy without product-partitioned queues
@ -239,10 +291,11 @@ no separate outbox container is needed for the primary design:
  `fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
  dispatcher.
 - **Tokens (§12):** the factory token still scopes `productId + capabilities +
-  factoryId`; the coordinator enforces it on `/fleet/claim`. A factory may
+  factoryId`. In the primary (coordinator-targeted) model the dispatcher only
-  consume cross-product messages **only** for products its token covers — model
+  ever targets a factory the scheduler deemed eligible, and the coordinator
-  this as per-product (or per-token) subscriptions/filters so least-privilege is
+  **re-checks the token on `/fleet/claim`** — so least-privilege holds without
-  preserved.
+  relying on the subscription topology. (In the self-select fallback, scope it
  with per-product/per-token subscription filters instead.)
 - **Visibility:** `tracker-web` keeps querying per product (state is still
  product-partitioned), so the UX is unchanged.
@ -381,6 +434,12 @@ flag-gated + reversible.
 - [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
 - [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
 ### Error handling & cleanup (lands with M2) — see §5.5
 - [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`.
 - [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
 - [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock.
 - [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
 ### M3 — On-demand factories (scale-to-zero)
 - [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
 - [ ] Optional warm-pool (1 small instance) if cold-start latency matters.