docs(fleet): add TODO to validate exactly-once claim under true contention

Capture the plan to close the known gap: existing CAS/fencing tests run on
MemoryDatastoreProvider (exact only for sequential calls), so single-winner
behavior is not yet demonstrated under a true interleaved read->write window
or against Cosmos _etag/If-Match. No production change expected; Cosmos
provider already conditions writes with IfMatch. Documents Option B
(adversarial interleaving test, no infra) and Option A (emulator-gated
integration test), acceptance criteria, and downstream doc updates.
This commit is contained in:
saravanakumardb1 2026-06-03 09:55:34 -07:00
parent 4ac5a747d1
commit 1b6e644ea6

View File

@ -0,0 +1,68 @@
# TODO — Validate exactly-once claim under true multi-writer contention
> **Status:** open · **Area:** fleet coordinator (`coordinator.ts` / `repository.ts`) · **Priority:** medium
> **Why this exists:** the exactly-once job-assignment guarantee (CAS on `rev` + lease-epoch fencing) is the correctness backbone of the agent fleet. It is well covered by unit tests, but those run against `MemoryDatastoreProvider`, whose `updateIfMatch` compares `rev` **synchronously** — so the existing "concurrency" tests cannot exercise a _true_ interleaved read→write race. This TODO captures the work to close that gap.
## Background (what's already correct)
- `coordinator.tryClaimJob` flips `queued → assigned` via `repo.revUpdateJob(id, productId, expectedRev, …)` — a compare-and-swap that admits exactly one winner; losers get `conflict` and re-select.
- Lease acquire/reclaim increments `leaseEpoch`; `patchJobFenced` rejects any write carrying `epoch < job.leaseEpoch` (fences zombie writers). Reaper bumps the epoch on reclaim.
- The Cosmos provider already conditions the write correctly: `packages/datastore/src/providers/cosmos.ts` sends `accessCondition: { type: 'IfMatch', condition: _etag }`, guarding the read→replace window **server-side**.
- ⇒ **No production code change is expected.** This is a test/validation task only.
## The gap
`MemoryDatastoreProvider.updateIfMatch` (`packages/datastore/src/providers/memory.ts`) is exact only for **sequential** calls. Two `claimNextJob` invocations in a test cannot truly interleave at the read→write boundary, so we have not _demonstrated_ single-winner behavior under a real contended window or against Cosmos `_etag`/If-Match semantics.
## Plan
### Option B — adversarial-interleaving test (recommended; no infra)
Prove the claim path is single-winner even when the read→write window interleaves, by injecting a yield between the read and the `updateIfMatch` write.
**Files to add:**
- `services/platform-service/src/modules/fleet/coordinator.contention.test.ts` — **NEW**
- Wrap `MemoryDatastoreProvider` so `updateIfMatch` `await`s a controllable barrier between reading current state and committing, forcing two concurrent `claimNextJob` calls to interleave at the CAS point.
- Assert: exactly **one** claim succeeds; the other returns `conflict`; the winner's job has `state='assigned'` and `leaseEpoch=1`; no lost update.
- Add a fencing variant: after a reaper reclaim bumps the epoch, a delayed write from the original holder is rejected (`409`/fenced) and the job is quarantined.
- `services/platform-service/src/modules/fleet/__testutils__/interleaving-provider.ts` — **NEW (optional)**
- A thin `MemoryDatastoreProvider` decorator exposing a `releaseAfterRead()` barrier; or inline the wrapper in the test file.
**No changes to:** `coordinator.ts`, `repository.ts`, datastore providers, `package.json`, CI.
**Run with:** existing `vitest run --pool forks` (`pnpm --filter @lysnrai/platform-service test`).
### Option A — Cosmos-emulator integration test (gold standard; needs infra)
Exercise the real `_etag`/If-Match path under genuine parallel writes.
**Files to add / edit:**
- `services/platform-service/src/modules/fleet/coordinator.contention.cosmos.test.ts`**NEW**, env-gated: `describe.skipIf(!process.env.COSMOS_ENDPOINT)`. Fire N parallel `claimNextJob` against the Cosmos emulator; assert single winner + `leaseEpoch=1`.
- `services/platform-service/package.json`**EDIT**: add a `test:integration` script (kept out of the default `test` run).
- `docker-compose.yml` already provides a Cosmos DB emulator for the prototype — reuse it; document the `COSMOS_ENDPOINT`/key env for the integration run rather than adding a parallel stack.
**Requires:** the Cosmos DB emulator running (already part of the prototype stack).
## Acceptance criteria
- [ ] A test forces an interleaved read→write window across two concurrent claims and proves exactly one winner (Option B).
- [ ] A fencing test proves a stale-epoch write after reclaim is rejected/quarantined.
- [ ] (Optional) An env-gated Cosmos-emulator test proves the same against real `_etag`/If-Match.
- [ ] No production logic changed (or, if a real bug is found, fix `coordinator.ts`/`repository.ts` — never weaken the test).
- [ ] `pnpm --filter @lysnrai/platform-service test` stays green.
## Downstream (interview-pack repo `star_interview_bl`)
Once validated, flip the honest "open item" to "validated under contention" in:
- `gigafactory/star_01_exactly_once_claim_fencing.md`
- `VERIFIED_METRICS.md` and `INTERVIEW_STORY_INDEX.md` (G01 gap line)
- honest-gap mentions in `COMPANY_TARGETED_STORY_PACKS/OPENAI_AI_AGENTS_INFRA.md`, `CHEAT_SHEETS/*`, and `INTERVIEW_ANSWER_BANK/FAILURE_CONFLICT_AMBIGUITY_STORIES.md`.
## References
- `coordinator.ts``tryClaimJob`, `claimNextJob`, `patchJobFenced`, `reapExpiredLeases`
- `repository.ts``revUpdateJob`, `revUpdateLease`
- `packages/datastore/src/providers/{memory,cosmos}.ts``updateIfMatch`
- `coordinator.test.ts` — existing single-winner test (sequential, memory provider)