From 9f24a7fdd01b48c5bf58fb84d5b1c593e1bb10f2 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Sun, 31 May 2026 22:45:21 -0700 Subject: [PATCH] docs(gigafactory): add error-handling & cleanup section + v3 review fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds §5.5 (lease-release-on-failure, branch/worktree GC, same-repo worktree clobber) with target invariants, plus a §12 checklist block. v3 review: unify targetFactoryId, reconcile §5.3 with complete-on-claim, align §6 token scoping with per-factory subscriptions, M0 wording. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> --- .../GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md | 83 ++++++++++++++++--- 1 file changed, 71 insertions(+), 12 deletions(-) diff --git a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md index f430972..5cf858d 100644 --- a/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md +++ b/agent-queue/docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md @@ -14,6 +14,12 @@ > idempotency key (`MessageId = jobId`), renamed migration steps `M0–M3` to > avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a > ticked roadmap checklist + auth/observability notes. +> - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior + +> lease-release-on-failure, branch/worktree GC, same-repo worktree clobber). +> Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled +> §5.3 with the complete-on-claim model (broker is not the redelivery path), +> aligned §6 token scoping with per-factory subscriptions, and added the GC / +> `POST /fleet/fail` checklist block to §12. --- @@ -96,7 +102,7 @@ Two structural problems surfaced while running the local fleet against broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which factory* should run a job and pushes a **targeted** message; the broker handles transport, visibility timeout, retries, and dead-lettering. -4. **Ship the cheap RU win first (B1) as Phase 0** — it is reversible, needs no +4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no new infra, and de-risks the broker migration by removing the bleed while the bigger change is built and shadowed. @@ -166,7 +172,7 @@ no separate outbox container is needed for the primary design: event. 2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change feed (via a lease container), runs the scheduler for each new/`queued` job, - stamps `assignedFactoryId` on the job (CAS), and **publishes** the targeted + stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted Service Bus message. 3. **Crash-safe & idempotent:** the change feed redelivers from the last checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on @@ -209,12 +215,21 @@ no separate outbox container is needed for the primary design: ### 5.3 Failure / retry / DLQ -- Engine failure / verify-fail ⇒ coordinator transitions `failed`; broker - message completed (don't redeliver a logical failure). -- Crash / lease-expiry ⇒ broker redelivers after visibility timeout **and** the - existing reaper reclaims the Cosmos lease (bumps epoch). The redelivered - message re-runs `/fleet/claim`, which fences the old holder. -- Exhausted retries ⇒ broker DLQ + Cosmos `retries_exhausted`. +Assumes the recommended **complete-on-claim** model (§5.2): the broker message is +completed at claim, so the broker is **not** the redelivery path — re-dispatch is +driven by Cosmos stage changes through the change feed (§5.1). + +- **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions + `failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5); + no redelivery (a logical failure is terminal unless a retry policy applies). +- **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++, + backoff) ⇒ change-feed re-dispatch to the next best factory. +- **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps + `leaseEpoch`, fencing the dead holder) and returns the job to `queued` ⇒ + change-feed re-dispatch. (With the alternative *renew-lock* model, broker + redelivery is the trigger instead — pick one, not both.) +- **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ + for visibility. ### 5.4 Routing model (the §1.1 fix) @@ -231,6 +246,43 @@ no separate outbox container is needed for the primary design: drop the bogus default `capabilities = "build"`, and stop hardcoding `mac-1/mac-2`. +### 5.5 Error handling & cleanup (worktrees, branches, leases) + +**Today (single-host, agent-queue.sh).** The worker already handles errors well: +the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/ +`capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a +retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP +checkpoint to `aq/wip/` on **every** exit path; `recover_orphans` requeues +dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`) +triggers `fleet_quarantine` → `failed/` that **never ships or merges** +(split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the +PR branch `aq/job/` is deleted on auto-merge (`--delete-branch`); the repo +worktree is force-recreated at the next job for that repo. + +**Gaps this redesign must close.** These are real loose ends in the current code: + +1. **No client-side lease release on failure.** `_finish_failure` is + fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the + reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued` + + release lease) so failure is reflected and the lease freed **immediately**. +2. **Unbounded git artifacts.** `aq/wip/` branches are never GC'd; worktrees + are cleaned only on reuse; unmerged `aq/job/` branches accumulate on + origin when auto-merge is off or blocked by branch protection. Target: a + periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and + sweep `aq/wip/*` after a job reaches a terminal/shipped stage. +3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is + force-recreated, so two same-repo jobs on one host collide. Target: **Service + Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host, + repo)` lock as a local backstop. + +**Target invariants.** +- Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ + mirrors `retries_exhausted` for visibility. +- Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued` ⇒ + change-feed re-dispatch (§5.3). +- Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch + with unmerged work or a worktree with an in-flight job. (Checklist in §12.) + --- ## 6. Per-product tenancy without product-partitioned queues @@ -239,10 +291,11 @@ no separate outbox container is needed for the primary design: `fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the dispatcher. - **Tokens (§12):** the factory token still scopes `productId + capabilities + - factoryId`; the coordinator enforces it on `/fleet/claim`. A factory may - consume cross-product messages **only** for products its token covers — model - this as per-product (or per-token) subscriptions/filters so least-privilege is - preserved. + factoryId`. In the primary (coordinator-targeted) model the dispatcher only + ever targets a factory the scheduler deemed eligible, and the coordinator + **re-checks the token on `/fleet/claim`** — so least-privilege holds without + relying on the subscription topology. (In the self-select fallback, scope it + with per-product/per-token subscription filters instead.) - **Visibility:** `tracker-web` keeps querying per product (state is still product-partitioned), so the UX is unchanged. @@ -381,6 +434,12 @@ flag-gated + reversible. - [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag. - [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct. +### Error handling & cleanup (lands with M2) — see §5.5 +- [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`. +- [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage. +- [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock. +- [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree. + ### M3 — On-demand factories (scale-to-zero) - [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers. - [ ] Optional warm-pool (1 small instance) if cold-start latency matters.