Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every reference (README, system-overview code-map, and all phase job specs). Add an index README that lists the docs and points to the companion docs in learning_ai_common_plat. Docs-only; no behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
109 lines
5.9 KiB
Markdown
109 lines
5.9 KiB
Markdown
---
|
|
engine: devin
|
|
cwd: /Users/sd9235/code/mygh/learning_ai_common_plat
|
|
yolo: true
|
|
lock: common-plat
|
|
timeout: 3h
|
|
---
|
|
|
|
ROLE: Senior distributed-systems engineer. P0 HARDENING: make the fleet
|
|
coordinator's job claim TRULY atomic. The Phase 2 Foundation (merged) implements
|
|
the claim as an in-module "rev compare-and-swap" layered over an UNCONDITIONAL
|
|
datastore read-then-replace. Because there are `await` points between the read,
|
|
the rev check, and the write, two CONCURRENT claims can both read the same rev,
|
|
both pass the check, and both write — a DOUBLE-ASSIGNMENT. The existing race test
|
|
only drives the claims SEQUENTIALLY, so it does not catch this. Fix the root cause.
|
|
|
|
CONTEXT TO READ FIRST:
|
|
- services/platform-service/src/modules/fleet/repository.ts — revUpdateJob /
|
|
revUpdateLease (the non-atomic read-check-write).
|
|
- services/platform-service/src/modules/fleet/coordinator.ts — tryClaimJob.
|
|
- services/platform-service/src/modules/fleet/coordinator.test.ts — the current
|
|
(sequential) "atomic claim race" test.
|
|
- packages/datastore — the shared datastore abstraction + its Memory and Cosmos
|
|
providers. Find the update/replace method and how (if at all) it exposes
|
|
optimistic concurrency.
|
|
- ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md §4 (atomic
|
|
claim is THE core contract) + §13 (maps to Cosmos _etag / If-Match).
|
|
|
|
PREREQUISITE / BRANCHING:
|
|
- Branch off CURRENT `main` (the foundation is already merged). New branch:
|
|
feat/gigafactory-p2-atomic-claim. Push + open a PR. DO NOT merge.
|
|
|
|
GOAL: a single-winner claim that holds under TRUE concurrency, backed by a
|
|
server-side conditional write (Cosmos If-Match/_etag) and a process-atomic memory
|
|
implementation — not a best-effort in-module check.
|
|
|
|
DELIVERABLES
|
|
|
|
1. ADD an optimistic-concurrency update to the shared datastore (@bytelyst/datastore).
|
|
This is a legitimate ADDITIVE shared-package feature (NOT a template-managed infra
|
|
file) — it MUST be backward-compatible and fully tested so existing consumers are
|
|
unaffected. Suggested API (match the package's existing style/naming):
|
|
updateIfMatch(id, partitionKey, expected: { etag?: string; rev?: number }, patch)
|
|
-> { ok: true, doc } | { ok: false, reason: 'conflict' | 'not_found' }
|
|
- COSMOS provider: perform a conditional replace using the document `_etag` with
|
|
`accessCondition { type: 'IfMatch', condition: etag }`; translate Cosmos 412
|
|
(precondition failed) → { ok:false, reason:'conflict' }. Surface `_etag` on reads
|
|
so callers can pass it back.
|
|
- MEMORY provider: implement the get → compare → set with NO `await`/yield between
|
|
the compare and the set (do it in one synchronous block inside the method) so two
|
|
concurrent callers CANNOT interleave within the single-threaded event loop. This
|
|
gives true in-process atomicity. Keep a monotonic rev (or reuse the existing one)
|
|
as the compare token for parity with Cosmos `_etag`.
|
|
- Do NOT change existing method signatures; only ADD. Update the provider interface
|
|
+ both providers + the package's index exports.
|
|
|
|
2. REWIRE the fleet repository to use it: revUpdateJob / revUpdateLease must perform
|
|
the compare-and-write through the new conditional update (no read-check-write with
|
|
an intervening await). The coordinator's tryClaimJob keeps the same external
|
|
behavior (returns ok / conflict) but is now genuinely atomic.
|
|
|
|
3. UPGRADE the tests to actually prove atomicity (these are the point of the slice):
|
|
- In datastore: unit tests for updateIfMatch on BOTH providers — match → writes +
|
|
bumps token; stale token → conflict, NO write; missing → not_found.
|
|
- In fleet coordinator: replace/extend the race test to drive TRUE concurrency:
|
|
(a) `await Promise.all([tryClaimJob(jobA), tryClaimJob(jobB)])` on the same
|
|
freshly-read job → exactly one ok, one conflict; job assigned once; exactly
|
|
one run; one lease; leaseEpoch == 1.
|
|
(b) an N-claimer stress test: fire N (>=10) concurrent claims for one job via
|
|
Promise.all → exactly one ok, N-1 conflicts; no double-assignment.
|
|
(c) the same for lease renew under contention (optional but preferred).
|
|
- These concurrent tests MUST fail against the OLD read-check-write and pass after
|
|
the fix (sanity-check that you are testing the right thing; mention it in the report).
|
|
|
|
4. Keep ALL existing platform-service tests green (the 50 fleet + the rest). Do not
|
|
weaken any test.
|
|
|
|
VERIFY GATE (must pass):
|
|
- pnpm --filter @bytelyst/datastore test (new conditional-update tests)
|
|
- pnpm --filter @bytelyst/datastore build
|
|
- pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet
|
|
- pnpm --filter @lysnrai/platform-service build
|
|
- pnpm build && pnpm test (full workspace — confirm no consumer of @bytelyst/datastore regressed)
|
|
|
|
CONSTRAINTS: ESM .js imports; no any; no console.log; additive + backward-compatible
|
|
datastore change with tests; conventional commits (feat(datastore): ... /
|
|
fix(platform-service): ...); never edit template-managed infra (.npmrc, docker-prep,
|
|
tsconfig.base, pnpm-workspace). Tests are sacred.
|
|
|
|
FINAL OUTPUT — print the report in EXACTLY this format:
|
|
|
|
## Implementation Report — Phase 2 Atomic-Claim Hardening
|
|
### Branch & commits / PR
|
|
### Files changed
|
|
- <path>: <summary>
|
|
### The fix
|
|
- datastore conditional update: <API, Cosmos If-Match mapping, memory atomicity approach>
|
|
- fleet rewire: <how revUpdate* now writes conditionally>
|
|
### Tests added (the proof)
|
|
- concurrent claim (Promise.all) + N-claimer stress: <results>
|
|
- did the new concurrent test FAIL on the old code? <yes/no + brief note>
|
|
- datastore conditional-update unit tests: <results>
|
|
### Verify gate results
|
|
- datastore test/build · fleet test · platform build · full pnpm build && test: <results>
|
|
### Deviations / assumptions
|
|
### Suggested next slice
|
|
- Phase 2 Slice 3: factory-agent integration (agent-queue.sh ↔ coordinator) now that
|
|
the claim is genuinely atomic.
|