docs(gigafactory): add error-handling & cleanup section + v3 review fixes

Adds §5.5 (lease-release-on-failure, branch/worktree GC, same-repo worktree
clobber) with target invariants, plus a §12 checklist block. v3 review:
unify targetFactoryId, reconcile §5.3 with complete-on-claim, align §6 token
scoping with per-factory subscriptions, M0 wording.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
saravanakumardb1 2026-05-31 22:45:21 -07:00
parent 8217a864e9
commit 9f24a7fdd0

View File

@ -14,6 +14,12 @@
> idempotency key (`MessageId = jobId`), renamed migration steps `M0M3` to > idempotency key (`MessageId = jobId`), renamed migration steps `M0M3` to
> avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a > avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
> ticked roadmap checklist + auth/observability notes. > ticked roadmap checklist + auth/observability notes.
> - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior +
> lease-release-on-failure, branch/worktree GC, same-repo worktree clobber).
> Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled
> §5.3 with the complete-on-claim model (broker is not the redelivery path),
> aligned §6 token scoping with per-factory subscriptions, and added the GC /
> `POST /fleet/fail` checklist block to §12.
--- ---
@ -96,7 +102,7 @@ Two structural problems surfaced while running the local fleet against
broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
factory* should run a job and pushes a **targeted** message; the broker factory* should run a job and pushes a **targeted** message; the broker
handles transport, visibility timeout, retries, and dead-lettering. handles transport, visibility timeout, retries, and dead-lettering.
4. **Ship the cheap RU win first (B1) as Phase 0** — it is reversible, needs no 4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no
new infra, and de-risks the broker migration by removing the bleed while the new infra, and de-risks the broker migration by removing the bleed while the
bigger change is built and shadowed. bigger change is built and shadowed.
@ -166,7 +172,7 @@ no separate outbox container is needed for the primary design:
event. event.
2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change 2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
feed (via a lease container), runs the scheduler for each new/`queued` job, feed (via a lease container), runs the scheduler for each new/`queued` job,
stamps `assignedFactoryId` on the job (CAS), and **publishes** the targeted stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted
Service Bus message. Service Bus message.
3. **Crash-safe & idempotent:** the change feed redelivers from the last 3. **Crash-safe & idempotent:** the change feed redelivers from the last
checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
@ -209,12 +215,21 @@ no separate outbox container is needed for the primary design:
### 5.3 Failure / retry / DLQ ### 5.3 Failure / retry / DLQ
- Engine failure / verify-fail ⇒ coordinator transitions `failed`; broker Assumes the recommended **complete-on-claim** model (§5.2): the broker message is
message completed (don't redeliver a logical failure). completed at claim, so the broker is **not** the redelivery path — re-dispatch is
- Crash / lease-expiry ⇒ broker redelivers after visibility timeout **and** the driven by Cosmos stage changes through the change feed (§5.1).
existing reaper reclaims the Cosmos lease (bumps epoch). The redelivered
message re-runs `/fleet/claim`, which fences the old holder. - **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions
- Exhausted retries ⇒ broker DLQ + Cosmos `retries_exhausted`. `failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5);
no redelivery (a logical failure is terminal unless a retry policy applies).
- **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++,
backoff) ⇒ change-feed re-dispatch to the next best factory.
- **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps
`leaseEpoch`, fencing the dead holder) and returns the job to `queued`
change-feed re-dispatch. (With the alternative *renew-lock* model, broker
redelivery is the trigger instead — pick one, not both.)
- **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ
for visibility.
### 5.4 Routing model (the §1.1 fix) ### 5.4 Routing model (the §1.1 fix)
@ -231,6 +246,43 @@ no separate outbox container is needed for the primary design:
drop the bogus default `capabilities = "build"`, and stop hardcoding drop the bogus default `capabilities = "build"`, and stop hardcoding
`mac-1/mac-2`. `mac-1/mac-2`.
### 5.5 Error handling & cleanup (worktrees, branches, leases)
**Today (single-host, agent-queue.sh).** The worker already handles errors well:
the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/
`capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a
retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP
checkpoint to `aq/wip/<job>` on **every** exit path; `recover_orphans` requeues
dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`)
triggers `fleet_quarantine``failed/` that **never ships or merges**
(split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the
PR branch `aq/job/<jid>` is deleted on auto-merge (`--delete-branch`); the repo
worktree is force-recreated at the next job for that repo.
**Gaps this redesign must close.** These are real loose ends in the current code:
1. **No client-side lease release on failure.** `_finish_failure` is
fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the
reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued`
+ release lease) so failure is reflected and the lease freed **immediately**.
2. **Unbounded git artifacts.** `aq/wip/<job>` branches are never GC'd; worktrees
are cleaned only on reuse; unmerged `aq/job/<jid>` branches accumulate on
origin when auto-merge is off or blocked by branch protection. Target: a
periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and
sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is
force-recreated, so two same-repo jobs on one host collide. Target: **Service
Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host,
repo)` lock as a local backstop.
**Target invariants.**
- Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ
mirrors `retries_exhausted` for visibility.
- Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued`
change-feed re-dispatch (§5.3).
- Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch
with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
--- ---
## 6. Per-product tenancy without product-partitioned queues ## 6. Per-product tenancy without product-partitioned queues
@ -239,10 +291,11 @@ no separate outbox container is needed for the primary design:
`fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the `fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
dispatcher. dispatcher.
- **Tokens (§12):** the factory token still scopes `productId + capabilities + - **Tokens (§12):** the factory token still scopes `productId + capabilities +
factoryId`; the coordinator enforces it on `/fleet/claim`. A factory may factoryId`. In the primary (coordinator-targeted) model the dispatcher only
consume cross-product messages **only** for products its token covers — model ever targets a factory the scheduler deemed eligible, and the coordinator
this as per-product (or per-token) subscriptions/filters so least-privilege is **re-checks the token on `/fleet/claim`** — so least-privilege holds without
preserved. relying on the subscription topology. (In the self-select fallback, scope it
with per-product/per-token subscription filters instead.)
- **Visibility:** `tracker-web` keeps querying per product (state is still - **Visibility:** `tracker-web` keeps querying per product (state is still
product-partitioned), so the UX is unchanged. product-partitioned), so the UX is unchanged.
@ -381,6 +434,12 @@ flag-gated + reversible.
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag. - [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct. - [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
### Error handling & cleanup (lands with M2) — see §5.5
- [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`.
- [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
- [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock.
- [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
### M3 — On-demand factories (scale-to-zero) ### M3 — On-demand factories (scale-to-zero)
- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers. - [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters. - [ ] Optional warm-pool (1 small instance) if cold-start latency matters.