diff --git a/agent-queue/docs/GIGAFACTORY_ROADMAP.md b/agent-queue/docs/GIGAFACTORY_ROADMAP.md index d49dd0a..7a3b2f6 100644 --- a/agent-queue/docs/GIGAFACTORY_ROADMAP.md +++ b/agent-queue/docs/GIGAFACTORY_ROADMAP.md @@ -104,6 +104,9 @@ Today's `agent-queue.sh` + `dashboard.mjs` (single host, zero-dep bash + Node): - [ ] Lifecycle stages are canonical and shared: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`). - [ ] The bash runner and the service speak the **same manifest + event vocabulary** (one schema, two transports). +> **Implementation status (2026-05-29) — Phase 2 Foundation merged** (common-plat PR #28, `platform-service/src/modules/fleet/`): all 7 `fleet_*` containers (§13) ✓; repositories + coordinator (claim/lease/fence/heartbeat/reaper) ✓; idempotency + deps + submit-time cycle detection ✓; 50 module tests green. +> **⚠ P0 hardening outstanding — atomic claim is not yet truly concurrency-safe.** The shared `@bytelyst/datastore` exposes no `If-Match`/`_etag` conditional write, so the claim is an in-module rev compare-and-swap over an **unconditional** read-then-replace, with `await` points between read→check→write. Under *true* concurrency (Promise.all / multiple coordinator instances) two claimers can both win → double-assignment. The current race test only drives claims **sequentially**, so it passes without proving the property. **Fix tracked:** `agent-queue/docs/jobs/phase2-atomic-claim-hardening.md` — add a conditional `updateIfMatch` to `@bytelyst/datastore` (Cosmos `_etag`/If-Match + process-atomic memory impl), rewire `revUpdate*`, and add Promise.all + N-claimer concurrent tests. **Must land before P2-S3 (factory integration).** + --- ## 5. The evolved Job manifest (feature) diff --git a/agent-queue/docs/jobs/phase2-atomic-claim-hardening.md b/agent-queue/docs/jobs/phase2-atomic-claim-hardening.md new file mode 100644 index 0000000..66c6697 --- /dev/null +++ b/agent-queue/docs/jobs/phase2-atomic-claim-hardening.md @@ -0,0 +1,108 @@ +--- +engine: devin +cwd: /Users/sd9235/code/mygh/learning_ai_common_plat +yolo: true +lock: common-plat +timeout: 3h +--- + +ROLE: Senior distributed-systems engineer. P0 HARDENING: make the fleet +coordinator's job claim TRULY atomic. The Phase 2 Foundation (merged) implements +the claim as an in-module "rev compare-and-swap" layered over an UNCONDITIONAL +datastore read-then-replace. Because there are `await` points between the read, +the rev check, and the write, two CONCURRENT claims can both read the same rev, +both pass the check, and both write — a DOUBLE-ASSIGNMENT. The existing race test +only drives the claims SEQUENTIALLY, so it does not catch this. Fix the root cause. + +CONTEXT TO READ FIRST: +- services/platform-service/src/modules/fleet/repository.ts — revUpdateJob / + revUpdateLease (the non-atomic read-check-write). +- services/platform-service/src/modules/fleet/coordinator.ts — tryClaimJob. +- services/platform-service/src/modules/fleet/coordinator.test.ts — the current + (sequential) "atomic claim race" test. +- packages/datastore — the shared datastore abstraction + its Memory and Cosmos + providers. Find the update/replace method and how (if at all) it exposes + optimistic concurrency. +- ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md §4 (atomic + claim is THE core contract) + §13 (maps to Cosmos _etag / If-Match). + +PREREQUISITE / BRANCHING: +- Branch off CURRENT `main` (the foundation is already merged). New branch: + feat/gigafactory-p2-atomic-claim. Push + open a PR. DO NOT merge. + +GOAL: a single-winner claim that holds under TRUE concurrency, backed by a +server-side conditional write (Cosmos If-Match/_etag) and a process-atomic memory +implementation — not a best-effort in-module check. + +DELIVERABLES + +1. ADD an optimistic-concurrency update to the shared datastore (@bytelyst/datastore). + This is a legitimate ADDITIVE shared-package feature (NOT a template-managed infra + file) — it MUST be backward-compatible and fully tested so existing consumers are + unaffected. Suggested API (match the package's existing style/naming): + updateIfMatch(id, partitionKey, expected: { etag?: string; rev?: number }, patch) + -> { ok: true, doc } | { ok: false, reason: 'conflict' | 'not_found' } + - COSMOS provider: perform a conditional replace using the document `_etag` with + `accessCondition { type: 'IfMatch', condition: etag }`; translate Cosmos 412 + (precondition failed) → { ok:false, reason:'conflict' }. Surface `_etag` on reads + so callers can pass it back. + - MEMORY provider: implement the get → compare → set with NO `await`/yield between + the compare and the set (do it in one synchronous block inside the method) so two + concurrent callers CANNOT interleave within the single-threaded event loop. This + gives true in-process atomicity. Keep a monotonic rev (or reuse the existing one) + as the compare token for parity with Cosmos `_etag`. + - Do NOT change existing method signatures; only ADD. Update the provider interface + + both providers + the package's index exports. + +2. REWIRE the fleet repository to use it: revUpdateJob / revUpdateLease must perform + the compare-and-write through the new conditional update (no read-check-write with + an intervening await). The coordinator's tryClaimJob keeps the same external + behavior (returns ok / conflict) but is now genuinely atomic. + +3. UPGRADE the tests to actually prove atomicity (these are the point of the slice): + - In datastore: unit tests for updateIfMatch on BOTH providers — match → writes + + bumps token; stale token → conflict, NO write; missing → not_found. + - In fleet coordinator: replace/extend the race test to drive TRUE concurrency: + (a) `await Promise.all([tryClaimJob(jobA), tryClaimJob(jobB)])` on the same + freshly-read job → exactly one ok, one conflict; job assigned once; exactly + one run; one lease; leaseEpoch == 1. + (b) an N-claimer stress test: fire N (>=10) concurrent claims for one job via + Promise.all → exactly one ok, N-1 conflicts; no double-assignment. + (c) the same for lease renew under contention (optional but preferred). + - These concurrent tests MUST fail against the OLD read-check-write and pass after + the fix (sanity-check that you are testing the right thing; mention it in the report). + +4. Keep ALL existing platform-service tests green (the 50 fleet + the rest). Do not + weaken any test. + +VERIFY GATE (must pass): +- pnpm --filter @bytelyst/datastore test (new conditional-update tests) +- pnpm --filter @bytelyst/datastore build +- pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet +- pnpm --filter @lysnrai/platform-service build +- pnpm build && pnpm test (full workspace — confirm no consumer of @bytelyst/datastore regressed) + +CONSTRAINTS: ESM .js imports; no any; no console.log; additive + backward-compatible +datastore change with tests; conventional commits (feat(datastore): ... / +fix(platform-service): ...); never edit template-managed infra (.npmrc, docker-prep, +tsconfig.base, pnpm-workspace). Tests are sacred. + +FINAL OUTPUT — print the report in EXACTLY this format: + +## Implementation Report — Phase 2 Atomic-Claim Hardening +### Branch & commits / PR +### Files changed +- : +### The fix +- datastore conditional update: +- fleet rewire: +### Tests added (the proof) +- concurrent claim (Promise.all) + N-claimer stress: +- did the new concurrent test FAIL on the old code? +- datastore conditional-update unit tests: +### Verify gate results +- datastore test/build · fleet test · platform build · full pnpm build && test: +### Deviations / assumptions +### Suggested next slice +- Phase 2 Slice 3: factory-agent integration (agent-queue.sh ↔ coordinator) now that + the claim is genuinely atomic.