docs(gigafactory): add error-handling & cleanup section + v3 review fixes
Adds §5.5 (lease-release-on-failure, branch/worktree GC, same-repo worktree clobber) with target invariants, plus a §12 checklist block. v3 review: unify targetFactoryId, reconcile §5.3 with complete-on-claim, align §6 token scoping with per-factory subscriptions, M0 wording. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
parent
8217a864e9
commit
9f24a7fdd0
@ -14,6 +14,12 @@
|
|||||||
> idempotency key (`MessageId = jobId`), renamed migration steps `M0–M3` to
|
> idempotency key (`MessageId = jobId`), renamed migration steps `M0–M3` to
|
||||||
> avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
|
> avoid collision with roadmap phases, fixed the Phase-0 RU figure, and added a
|
||||||
> ticked roadmap checklist + auth/observability notes.
|
> ticked roadmap checklist + auth/observability notes.
|
||||||
|
> - v3 (2026-05-31): added **§5.5 Error handling & cleanup** (current behavior +
|
||||||
|
> lease-release-on-failure, branch/worktree GC, same-repo worktree clobber).
|
||||||
|
> Review fixes: unified the field name to `targetFactoryId` (§5.1), reconciled
|
||||||
|
> §5.3 with the complete-on-claim model (broker is not the redelivery path),
|
||||||
|
> aligned §6 token scoping with per-factory subscriptions, and added the GC /
|
||||||
|
> `POST /fleet/fail` checklist block to §12.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -96,7 +102,7 @@ Two structural problems surfaced while running the local fleet against
|
|||||||
broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
|
broker-owns-delivery hybrid** (B2 ⊕ B3): the coordinator decides *which
|
||||||
factory* should run a job and pushes a **targeted** message; the broker
|
factory* should run a job and pushes a **targeted** message; the broker
|
||||||
handles transport, visibility timeout, retries, and dead-lettering.
|
handles transport, visibility timeout, retries, and dead-lettering.
|
||||||
4. **Ship the cheap RU win first (B1) as Phase 0** — it is reversible, needs no
|
4. **Ship the cheap RU win first (B1) as step M0** — it is reversible, needs no
|
||||||
new infra, and de-risks the broker migration by removing the bleed while the
|
new infra, and de-risks the broker migration by removing the bleed while the
|
||||||
bigger change is built and shadowed.
|
bigger change is built and shadowed.
|
||||||
|
|
||||||
@ -166,7 +172,7 @@ no separate outbox container is needed for the primary design:
|
|||||||
event.
|
event.
|
||||||
2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
|
2. A single **dispatcher** (coordinator process) tails the `fleet_jobs` change
|
||||||
feed (via a lease container), runs the scheduler for each new/`queued` job,
|
feed (via a lease container), runs the scheduler for each new/`queued` job,
|
||||||
stamps `assignedFactoryId` on the job (CAS), and **publishes** the targeted
|
stamps `targetFactoryId` on the job (CAS), and **publishes** the targeted
|
||||||
Service Bus message.
|
Service Bus message.
|
||||||
3. **Crash-safe & idempotent:** the change feed redelivers from the last
|
3. **Crash-safe & idempotent:** the change feed redelivers from the last
|
||||||
checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
|
checkpoint on dispatcher restart; Service Bus **duplicate detection** keyed on
|
||||||
@ -209,12 +215,21 @@ no separate outbox container is needed for the primary design:
|
|||||||
|
|
||||||
### 5.3 Failure / retry / DLQ
|
### 5.3 Failure / retry / DLQ
|
||||||
|
|
||||||
- Engine failure / verify-fail ⇒ coordinator transitions `failed`; broker
|
Assumes the recommended **complete-on-claim** model (§5.2): the broker message is
|
||||||
message completed (don't redeliver a logical failure).
|
completed at claim, so the broker is **not** the redelivery path — re-dispatch is
|
||||||
- Crash / lease-expiry ⇒ broker redelivers after visibility timeout **and** the
|
driven by Cosmos stage changes through the change feed (§5.1).
|
||||||
existing reaper reclaims the Cosmos lease (bumps epoch). The redelivered
|
|
||||||
message re-runs `/fleet/claim`, which fences the old holder.
|
- **Logical failure** (engine error / verify-fail) ⇒ coordinator transitions
|
||||||
- Exhausted retries ⇒ broker DLQ + Cosmos `retries_exhausted`.
|
`failed` and **releases the lease immediately** (new `/fleet/fail`, see §5.5);
|
||||||
|
no redelivery (a logical failure is terminal unless a retry policy applies).
|
||||||
|
- **Retryable failure** ⇒ coordinator sets the job back to `queued` (attempts++,
|
||||||
|
backoff) ⇒ change-feed re-dispatch to the next best factory.
|
||||||
|
- **Crash / lease-expiry** ⇒ the **reaper** reclaims the Cosmos lease (bumps
|
||||||
|
`leaseEpoch`, fencing the dead holder) and returns the job to `queued` ⇒
|
||||||
|
change-feed re-dispatch. (With the alternative *renew-lock* model, broker
|
||||||
|
redelivery is the trigger instead — pick one, not both.)
|
||||||
|
- **Exhausted retries** ⇒ Cosmos `retries_exhausted`; mirror to the broker DLQ
|
||||||
|
for visibility.
|
||||||
|
|
||||||
### 5.4 Routing model (the §1.1 fix)
|
### 5.4 Routing model (the §1.1 fix)
|
||||||
|
|
||||||
@ -231,6 +246,43 @@ no separate outbox container is needed for the primary design:
|
|||||||
drop the bogus default `capabilities = "build"`, and stop hardcoding
|
drop the bogus default `capabilities = "build"`, and stop hardcoding
|
||||||
`mac-1/mac-2`.
|
`mac-1/mac-2`.
|
||||||
|
|
||||||
|
### 5.5 Error handling & cleanup (worktrees, branches, leases)
|
||||||
|
|
||||||
|
**Today (single-host, agent-queue.sh).** The worker already handles errors well:
|
||||||
|
the stage machine routes `timeout`/`budget_exceeded`/`crash`/`verify_failed`/
|
||||||
|
`capability_mismatch`/`no_engine` through `_finish_failure` (→ `failed/`, with a
|
||||||
|
retry policy that requeues to `inbox/` with backoff); a `trap` writes a WIP
|
||||||
|
checkpoint to `aq/wip/<job>` on **every** exit path; `recover_orphans` requeues
|
||||||
|
dead-worker `building/` jobs; and a **FENCED** report (stale `leaseEpoch`)
|
||||||
|
triggers `fleet_quarantine` → `failed/` that **never ships or merges**
|
||||||
|
(split-brain guard). PR/merge cleanup: `.aq_pr.md` is removed before commit; the
|
||||||
|
PR branch `aq/job/<jid>` is deleted on auto-merge (`--delete-branch`); the repo
|
||||||
|
worktree is force-recreated at the next job for that repo.
|
||||||
|
|
||||||
|
**Gaps this redesign must close.** These are real loose ends in the current code:
|
||||||
|
|
||||||
|
1. **No client-side lease release on failure.** `_finish_failure` is
|
||||||
|
fleet-agnostic, so a failed fleet job's lease only frees on **expiry** via the
|
||||||
|
reaper — slow recovery. Target: a `POST /fleet/fail` (stage=`failed`/`queued`
|
||||||
|
+ release lease) so failure is reflected and the lease freed **immediately**.
|
||||||
|
2. **Unbounded git artifacts.** `aq/wip/<job>` branches are never GC'd; worktrees
|
||||||
|
are cleaned only on reuse; unmerged `aq/job/<jid>` branches accumulate on
|
||||||
|
origin when auto-merge is off or blocked by branch protection. Target: a
|
||||||
|
periodic **GC** sweep — delete merged `aq/job/*`, prune stale worktrees, and
|
||||||
|
sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
|
||||||
|
3. **Same-repo concurrency can clobber a worktree.** The per-repo worktree is
|
||||||
|
force-recreated, so two same-repo jobs on one host collide. Target: **Service
|
||||||
|
Bus sessions keyed by `repo`** (serialize same-repo work) plus a per-`(host,
|
||||||
|
repo)` lock as a local backstop.
|
||||||
|
|
||||||
|
**Target invariants.**
|
||||||
|
- Terminal failure ⇒ Cosmos `failed` + lease released now (no expiry wait); DLQ
|
||||||
|
mirrors `retries_exhausted` for visibility.
|
||||||
|
- Crash / fence ⇒ reaper bumps `leaseEpoch` (fences zombie) ⇒ `queued` ⇒
|
||||||
|
change-feed re-dispatch (§5.3).
|
||||||
|
- Cleanup is **explicit and idempotent** — safe to re-run, never deletes a branch
|
||||||
|
with unmerged work or a worktree with an in-flight job. (Checklist in §12.)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Per-product tenancy without product-partitioned queues
|
## 6. Per-product tenancy without product-partitioned queues
|
||||||
@ -239,10 +291,11 @@ no separate outbox container is needed for the primary design:
|
|||||||
`fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
|
`fleet_budgets /productId` in `claimNextJob`); unchanged, just moved to the
|
||||||
dispatcher.
|
dispatcher.
|
||||||
- **Tokens (§12):** the factory token still scopes `productId + capabilities +
|
- **Tokens (§12):** the factory token still scopes `productId + capabilities +
|
||||||
factoryId`; the coordinator enforces it on `/fleet/claim`. A factory may
|
factoryId`. In the primary (coordinator-targeted) model the dispatcher only
|
||||||
consume cross-product messages **only** for products its token covers — model
|
ever targets a factory the scheduler deemed eligible, and the coordinator
|
||||||
this as per-product (or per-token) subscriptions/filters so least-privilege is
|
**re-checks the token on `/fleet/claim`** — so least-privilege holds without
|
||||||
preserved.
|
relying on the subscription topology. (In the self-select fallback, scope it
|
||||||
|
with per-product/per-token subscription filters instead.)
|
||||||
- **Visibility:** `tracker-web` keeps querying per product (state is still
|
- **Visibility:** `tracker-web` keeps querying per product (state is still
|
||||||
product-partitioned), so the UX is unchanged.
|
product-partitioned), so the UX is unchanged.
|
||||||
|
|
||||||
@ -381,6 +434,12 @@ flag-gated + reversible.
|
|||||||
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
|
- [ ] Emit metrics: subscription depth, dispatch lag, 409 claim-conflict rate, DLQ count, change-feed lag.
|
||||||
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
|
- [ ] Verify exactly-once + crash recovery on a real multi-host run; DLQ ↔ `failed`/`retries_exhausted` mapping correct.
|
||||||
|
|
||||||
|
### Error handling & cleanup (lands with M2) — see §5.5
|
||||||
|
- [ ] Add `POST /fleet/fail` so a failed job sets the coordinator stage + **releases the lease immediately** (no expiry wait); wire it into `_finish_failure` / `fleet_quarantine`.
|
||||||
|
- [ ] GC sweep (idempotent): delete merged `aq/job/*` branches, prune stale worktrees, sweep `aq/wip/*` after a job reaches a terminal/shipped stage.
|
||||||
|
- [ ] Prevent same-repo worktree clobber: Service Bus **sessions keyed by `repo`** + a per-`(host, repo)` local lock.
|
||||||
|
- [ ] Verify: failed jobs free their lease promptly; no orphaned worktrees/branches after N jobs; GC never deletes unmerged work or an in-flight worktree.
|
||||||
|
|
||||||
### M3 — On-demand factories (scale-to-zero)
|
### M3 — On-demand factories (scale-to-zero)
|
||||||
- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
|
- [ ] KEDA / Container Apps scaler on subscription depth; idle ⇒ zero running workers.
|
||||||
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
|
- [ ] Optional warm-pool (1 small instance) if cold-start latency matters.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user