diff --git a/services/platform-service/src/modules/fleet/CONTENTION_VALIDATION_TODO.md b/services/platform-service/src/modules/fleet/CONTENTION_VALIDATION_TODO.md new file mode 100644 index 00000000..9a561e92 --- /dev/null +++ b/services/platform-service/src/modules/fleet/CONTENTION_VALIDATION_TODO.md @@ -0,0 +1,68 @@ +# TODO — Validate exactly-once claim under true multi-writer contention + +> **Status:** open · **Area:** fleet coordinator (`coordinator.ts` / `repository.ts`) · **Priority:** medium +> **Why this exists:** the exactly-once job-assignment guarantee (CAS on `rev` + lease-epoch fencing) is the correctness backbone of the agent fleet. It is well covered by unit tests, but those run against `MemoryDatastoreProvider`, whose `updateIfMatch` compares `rev` **synchronously** — so the existing "concurrency" tests cannot exercise a _true_ interleaved read→write race. This TODO captures the work to close that gap. + +## Background (what's already correct) + +- `coordinator.tryClaimJob` flips `queued → assigned` via `repo.revUpdateJob(id, productId, expectedRev, …)` — a compare-and-swap that admits exactly one winner; losers get `conflict` and re-select. +- Lease acquire/reclaim increments `leaseEpoch`; `patchJobFenced` rejects any write carrying `epoch < job.leaseEpoch` (fences zombie writers). Reaper bumps the epoch on reclaim. +- The Cosmos provider already conditions the write correctly: `packages/datastore/src/providers/cosmos.ts` sends `accessCondition: { type: 'IfMatch', condition: _etag }`, guarding the read→replace window **server-side**. +- ⇒ **No production code change is expected.** This is a test/validation task only. + +## The gap + +`MemoryDatastoreProvider.updateIfMatch` (`packages/datastore/src/providers/memory.ts`) is exact only for **sequential** calls. Two `claimNextJob` invocations in a test cannot truly interleave at the read→write boundary, so we have not _demonstrated_ single-winner behavior under a real contended window or against Cosmos `_etag`/If-Match semantics. + +## Plan + +### Option B — adversarial-interleaving test (recommended; no infra) + +Prove the claim path is single-winner even when the read→write window interleaves, by injecting a yield between the read and the `updateIfMatch` write. + +**Files to add:** + +- `services/platform-service/src/modules/fleet/coordinator.contention.test.ts` — **NEW** + - Wrap `MemoryDatastoreProvider` so `updateIfMatch` `await`s a controllable barrier between reading current state and committing, forcing two concurrent `claimNextJob` calls to interleave at the CAS point. + - Assert: exactly **one** claim succeeds; the other returns `conflict`; the winner's job has `state='assigned'` and `leaseEpoch=1`; no lost update. + - Add a fencing variant: after a reaper reclaim bumps the epoch, a delayed write from the original holder is rejected (`409`/fenced) and the job is quarantined. +- `services/platform-service/src/modules/fleet/__testutils__/interleaving-provider.ts` — **NEW (optional)** + - A thin `MemoryDatastoreProvider` decorator exposing a `releaseAfterRead()` barrier; or inline the wrapper in the test file. + +**No changes to:** `coordinator.ts`, `repository.ts`, datastore providers, `package.json`, CI. +**Run with:** existing `vitest run --pool forks` (`pnpm --filter @lysnrai/platform-service test`). + +### Option A — Cosmos-emulator integration test (gold standard; needs infra) + +Exercise the real `_etag`/If-Match path under genuine parallel writes. + +**Files to add / edit:** + +- `services/platform-service/src/modules/fleet/coordinator.contention.cosmos.test.ts` — **NEW**, env-gated: `describe.skipIf(!process.env.COSMOS_ENDPOINT)`. Fire N parallel `claimNextJob` against the Cosmos emulator; assert single winner + `leaseEpoch=1`. +- `services/platform-service/package.json` — **EDIT**: add a `test:integration` script (kept out of the default `test` run). +- `docker-compose.yml` already provides a Cosmos DB emulator for the prototype — reuse it; document the `COSMOS_ENDPOINT`/key env for the integration run rather than adding a parallel stack. + +**Requires:** the Cosmos DB emulator running (already part of the prototype stack). + +## Acceptance criteria + +- [ ] A test forces an interleaved read→write window across two concurrent claims and proves exactly one winner (Option B). +- [ ] A fencing test proves a stale-epoch write after reclaim is rejected/quarantined. +- [ ] (Optional) An env-gated Cosmos-emulator test proves the same against real `_etag`/If-Match. +- [ ] No production logic changed (or, if a real bug is found, fix `coordinator.ts`/`repository.ts` — never weaken the test). +- [ ] `pnpm --filter @lysnrai/platform-service test` stays green. + +## Downstream (interview-pack repo `star_interview_bl`) + +Once validated, flip the honest "open item" to "validated under contention" in: + +- `gigafactory/star_01_exactly_once_claim_fencing.md` +- `VERIFIED_METRICS.md` and `INTERVIEW_STORY_INDEX.md` (G01 gap line) +- honest-gap mentions in `COMPANY_TARGETED_STORY_PACKS/OPENAI_AI_AGENTS_INFRA.md`, `CHEAT_SHEETS/*`, and `INTERVIEW_ANSWER_BANK/FAILURE_CONFLICT_AMBIGUITY_STORIES.md`. + +## References + +- `coordinator.ts` — `tryClaimJob`, `claimNextJob`, `patchJobFenced`, `reapExpiredLeases` +- `repository.ts` — `revUpdateJob`, `revUpdateLease` +- `packages/datastore/src/providers/{memory,cosmos}.ts` — `updateIfMatch` +- `coordinator.test.ts` — existing single-winner test (sequential, memory provider)