bytelyst-devops-tools/agent-queue/docs/jobs/phase2-atomic-claim-hardening.md
saravanakumardb1 2e9bd4dd1e docs(agent-queue): record P2 Foundation merged + track P0 atomic-claim hardening (§4)
- §4: implementation-status note — fleet module merged (PR #28); atomic claim NOT
  yet concurrency-safe (rev-CAS over unconditional write, sequential-only test)
- add phase2-atomic-claim-hardening.md: updateIfMatch in @bytelyst/datastore
  (Cosmos If-Match + process-atomic memory) + concurrent claim tests
2026-05-29 20:43:28 -07:00

5.9 KiB

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_common_plat true common-plat 3h

ROLE: Senior distributed-systems engineer. P0 HARDENING: make the fleet coordinator's job claim TRULY atomic. The Phase 2 Foundation (merged) implements the claim as an in-module "rev compare-and-swap" layered over an UNCONDITIONAL datastore read-then-replace. Because there are await points between the read, the rev check, and the write, two CONCURRENT claims can both read the same rev, both pass the check, and both write — a DOUBLE-ASSIGNMENT. The existing race test only drives the claims SEQUENTIALLY, so it does not catch this. Fix the root cause.

CONTEXT TO READ FIRST:

  • services/platform-service/src/modules/fleet/repository.ts — revUpdateJob / revUpdateLease (the non-atomic read-check-write).
  • services/platform-service/src/modules/fleet/coordinator.ts — tryClaimJob.
  • services/platform-service/src/modules/fleet/coordinator.test.ts — the current (sequential) "atomic claim race" test.
  • packages/datastore — the shared datastore abstraction + its Memory and Cosmos providers. Find the update/replace method and how (if at all) it exposes optimistic concurrency.
  • ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md §4 (atomic claim is THE core contract) + §13 (maps to Cosmos _etag / If-Match).

PREREQUISITE / BRANCHING:

  • Branch off CURRENT main (the foundation is already merged). New branch: feat/gigafactory-p2-atomic-claim. Push + open a PR. DO NOT merge.

GOAL: a single-winner claim that holds under TRUE concurrency, backed by a server-side conditional write (Cosmos If-Match/_etag) and a process-atomic memory implementation — not a best-effort in-module check.

DELIVERABLES

  1. ADD an optimistic-concurrency update to the shared datastore (@bytelyst/datastore). This is a legitimate ADDITIVE shared-package feature (NOT a template-managed infra file) — it MUST be backward-compatible and fully tested so existing consumers are unaffected. Suggested API (match the package's existing style/naming): updateIfMatch(id, partitionKey, expected: { etag?: string; rev?: number }, patch) -> { ok: true, doc } | { ok: false, reason: 'conflict' | 'not_found' }

    • COSMOS provider: perform a conditional replace using the document _etag with accessCondition { type: 'IfMatch', condition: etag }; translate Cosmos 412 (precondition failed) → { ok:false, reason:'conflict' }. Surface _etag on reads so callers can pass it back.
    • MEMORY provider: implement the get → compare → set with NO await/yield between the compare and the set (do it in one synchronous block inside the method) so two concurrent callers CANNOT interleave within the single-threaded event loop. This gives true in-process atomicity. Keep a monotonic rev (or reuse the existing one) as the compare token for parity with Cosmos _etag.
    • Do NOT change existing method signatures; only ADD. Update the provider interface
      • both providers + the package's index exports.
  2. REWIRE the fleet repository to use it: revUpdateJob / revUpdateLease must perform the compare-and-write through the new conditional update (no read-check-write with an intervening await). The coordinator's tryClaimJob keeps the same external behavior (returns ok / conflict) but is now genuinely atomic.

  3. UPGRADE the tests to actually prove atomicity (these are the point of the slice):

    • In datastore: unit tests for updateIfMatch on BOTH providers — match → writes + bumps token; stale token → conflict, NO write; missing → not_found.
    • In fleet coordinator: replace/extend the race test to drive TRUE concurrency: (a) await Promise.all([tryClaimJob(jobA), tryClaimJob(jobB)]) on the same freshly-read job → exactly one ok, one conflict; job assigned once; exactly one run; one lease; leaseEpoch == 1. (b) an N-claimer stress test: fire N (>=10) concurrent claims for one job via Promise.all → exactly one ok, N-1 conflicts; no double-assignment. (c) the same for lease renew under contention (optional but preferred).
    • These concurrent tests MUST fail against the OLD read-check-write and pass after the fix (sanity-check that you are testing the right thing; mention it in the report).
  4. Keep ALL existing platform-service tests green (the 50 fleet + the rest). Do not weaken any test.

VERIFY GATE (must pass):

  • pnpm --filter @bytelyst/datastore test (new conditional-update tests)
  • pnpm --filter @bytelyst/datastore build
  • pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet
  • pnpm --filter @lysnrai/platform-service build
  • pnpm build && pnpm test (full workspace — confirm no consumer of @bytelyst/datastore regressed)

CONSTRAINTS: ESM .js imports; no any; no console.log; additive + backward-compatible datastore change with tests; conventional commits (feat(datastore): ... / fix(platform-service): ...); never edit template-managed infra (.npmrc, docker-prep, tsconfig.base, pnpm-workspace). Tests are sacred.

FINAL OUTPUT — print the report in EXACTLY this format:

Implementation Report — Phase 2 Atomic-Claim Hardening

Branch & commits / PR

Files changed

  • :

The fix

  • datastore conditional update: <API, Cosmos If-Match mapping, memory atomicity approach>
  • fleet rewire: <how revUpdate* now writes conditionally>

Tests added (the proof)

  • concurrent claim (Promise.all) + N-claimer stress:
  • did the new concurrent test FAIL on the old code? <yes/no + brief note>
  • datastore conditional-update unit tests:

Verify gate results

  • datastore test/build · fleet test · platform build · full pnpm build && test:

Deviations / assumptions

Suggested next slice

  • Phase 2 Slice 3: factory-agent integration (agent-queue.sh ↔ coordinator) now that the claim is genuinely atomic.