From 9f24a7fdd01b48c5bf58fb84d5b1c593e1bb10f2 Mon Sep 17 00:00:00 2001
From: saravanakumardb1 <saravanakumardb1@users.noreply.github.com>
Date: Sun, 31 May 2026 22:45:21 -0700
Subject: [PATCH] docs(gigafactory): add error-handling & cleanup section + v3
 review fixes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds §5.5 (lease-release-on-failure, branch/worktree GC, same-repo worktree
clobber) with target invariants, plus a §12 checklist block. v3 review:
unify targetFactoryId, reconcile §5.3 with complete-on-claim, align §6 token
scoping with per-factory subscriptions, M0 wording.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
---
 .../GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md    | 83 ++++++++++++++++---
 1 file changed, 71 insertions(+), 12 deletions(-)

diff --git a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
index f430972..5cf858d 100644
--- a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
+++ b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md
@@ -14,6 +14,12 @@
 >   idempotency key (`MessageId = jobId`), renamed migration steps `M0–M3` to
 >   avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
 >   ticked roadmap checklist + auth/observability notes.
+> - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior +
+>   lease-release-on-failure, branch/worktree GC, same-repo worktree clobber).
+>   Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled
+>   §5.3 with the complete-on-claim model (broker is not the redelivery path),
+>   aligned §6 token scoping with per-factory subscriptions, and added the GC /
+>   `POST /fleet/fail` checklist block to §12.
 
 ---
 
@@ -96,7 +102,7 @@ Two structural problems surfaced while running the local fleet against
    broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
    factory* should run a job and pushes a **targeted** message; the broker
    handles transport, visibility timeout, retries, and dead-lettering.
-4. **Ship the cheap RU win first (B1) as Phase 0** — it is reversible, needs no
+4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no
    new infra, and de-risks the broker migration by removing the bleed while the
    bigger change is built and shadowed.
 
@@ -166,7 +172,7 @@ no separate outbox container is needed for the primary design:
    event.
 2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
    feed (via a lease container), runs the scheduler for each new/`queued` job,
-   stamps `assignedFactoryId` on the job (CAS), and **publishes** the targeted
+   stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted
    Service Bus message.
 3. **Crash-safe & idempotent:** the change feed redelivers from the last
    checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
@@ -209,12 +215,21 @@ no separate outbox container is needed for the primary design:
 
 ### 5.3 Failure / retry / DLQ
 
-- Engine failure / verify-fail ⇒ coordinator transitions `failed`; broker
-  message completed (don't redeliver a logical failure).
-- Crash / lease-expiry ⇒ broker redelivers after visibility timeout **and** the
-  existing reaper reclaims the Cosmos lease (bumps epoch). The redelivered
-  message re-runs `/fleet/claim`, which fences the old holder.
-- Exhausted retries ⇒ broker DLQ + Cosmos `retries_exhausted`.
+Assumes the recommended **complete-on-claim** model (§5.2): the broker message is
+completed at claim, so the broker is **not** the redelivery path — re-dispatch is
+driven by Cosmos stage changes through the change feed (§5.1).
+
+- **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions
+  `failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5);
+  no redelivery (a logical failure is terminal unless a retry policy applies).
+- **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++,
+  backoff) ⇒ change-feed re-dispatch to the next best factory.
+- **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps
+  `leaseEpoch`, fencing the dead holder) and returns the job to `queued` ⇒
+  change-feed re-dispatch. (With the alternative *renew-lock* model, broker
+  redelivery is the trigger instead — pick one, not both.)
+- **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ
+  for visibility.
 
 ### 5.4 Routing model (the §1.1 fix)
 
@@ -231,6 +246,43 @@ no separate outbox container is needed for the primary design:
   drop the bogus default `capabilities = "build"`, and stop hardcoding
   `mac-1/mac-2`.
 
+### 5.5 Error handling & cleanup (worktrees, branches, leases)
+
+**Today (single-host, agent-queue.sh).** The worker already handles errors well:
+the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/
+`capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a
+retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP
+checkpoint to `aq/wip/<job>` on **every** exit path; `recover_orphans` requeues
+dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`)
+triggers `fleet_quarantine` → `failed/` that **never ships or merges**
+(split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the
+PR branch `aq/job/<jid>` is deleted on auto-merge (`--delete-branch`); the repo
+worktree is force-recreated at the next job for that repo.
+
+**Gaps this redesign must close.** These are real loose ends in the current code:
+
+1. **No client-side lease release on failure.** `_finish_failure` is
+   fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the
+   reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued`
+   + release lease) so failure is reflected and the lease freed **immediately**.
+2. **Unbounded git artifacts.** `aq/wip/<job>` branches are never GC'd; worktrees
+   are cleaned only on reuse; unmerged `aq/job/<jid>` branches accumulate on
+   origin when auto-merge is off or blocked by branch protection. Target: a
+   periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and
+   sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
+3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is
+   force-recreated, so two same-repo jobs on one host collide. Target: **Service
+   Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host,
+   repo)` lock as a local backstop.
+
+**Target invariants.**
+- Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ
+  mirrors `retries_exhausted` for visibility.
+- Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued` ⇒
+  change-feed re-dispatch (§5.3).
+- Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch
+  with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
+
 ---
 
 ## 6. Per-product tenancy without product-partitioned queues
@@ -239,10 +291,11 @@ no separate outbox container is needed for the primary design:
   `fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
   dispatcher.
 - **Tokens (§12):** the factory token still scopes `productId + capabilities +
-  factoryId`; the coordinator enforces it on `/fleet/claim`. A factory may
-  consume cross-product messages **only** for products its token covers — model
-  this as per-product (or per-token) subscriptions/filters so least-privilege is
-  preserved.
+  factoryId`. In the primary (coordinator-targeted) model the dispatcher only
+  ever targets a factory the scheduler deemed eligible, and the coordinator
+  **re-checks the token on `/fleet/claim`** — so least-privilege holds without
+  relying on the subscription topology. (In the self-select fallback, scope it
+  with per-product/per-token subscription filters instead.)
 - **Visibility:** `tracker-web` keeps querying per product (state is still
   product-partitioned), so the UX is unchanged.
 
@@ -381,6 +434,12 @@ flag-gated + reversible.
 - [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
 - [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
 
+### Error handling & cleanup (lands with M2) — see §5.5
+- [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`.
+- [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
+- [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock.
+- [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
+
 ### M3 — On-demand factories (scale-to-zero)
 - [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
 - [ ] Optional warm-pool (1 small instance) if cold-start latency matters.