From 0e94705ab7840b7d5e06f7f34221a82cd633b6b8 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Fri, 29 May 2026 19:54:33 -0700 Subject: [PATCH] docs(agent-queue): draft Phase 2 Foundation long-run prompt (fleet module + coordinator: claim/lease/fencing/reaper) --- agent-queue/docs/jobs/phase2-foundation.md | 179 +++++++++++++++++++++ 1 file changed, 179 insertions(+) create mode 100644 agent-queue/docs/jobs/phase2-foundation.md diff --git a/agent-queue/docs/jobs/phase2-foundation.md b/agent-queue/docs/jobs/phase2-foundation.md new file mode 100644 index 0000000..8c908ca --- /dev/null +++ b/agent-queue/docs/jobs/phase2-foundation.md @@ -0,0 +1,179 @@ +--- +engine: devin +cwd: /Users/sd9235/code/mygh/learning_ai_common_plat +yolo: true +lock: common-plat +timeout: 5h +--- + +ROLE: Senior backend / distributed-systems engineer. Implement the PHASE 2 +FOUNDATION of the agent gigafactory: a new `fleet` module in platform-service +covering (S1) the durable data model + repositories AND (S2) the CONCURRENCY CORE +— atomic claim, leases, fencing, heartbeat, and a reaper. This is one long, +self-contained backend slice. It supersedes the single-host stand-ins built in the +agent-queue (devops-tools) repo. + +WHY THIS IS A SAFE LONG (UNATTENDED) RUN: everything is in ONE repo +(learning_ai_common_plat), all logic is TypeScript, and ALL tests run on the +in-memory datastore provider (DB_PROVIDER=memory) — NO live platform-service, NO +Cosmos, NO network calls, NO tokens required. There are no external blockers. + +READ FIRST (this is NOT the platform-service you may assume — verify conventions): +- services/platform-service/src/modules/items/{types,repository,routes}.ts — copy + this module pattern EXACTLY: types.ts -> repository.ts -> routes.ts, Zod schemas, + the cloud-agnostic datastore, productId on every doc, req.log/app.log, ESM with + .js import suffixes, no `any`, no console.log. +- packages/datastore (or the existing datastore abstraction) — how repositories are + built, how optimistic concurrency (_etag / If-Match) is exposed, and how the + memory vs cosmos provider is selected (DB_PROVIDER). +- packages/cosmos container registry — how containers are registered. +- The fleet spec lives in the sibling devops-tools repo (read-only): + ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md + §4 (core contract: idempotency/atomic-claim/fencing/lease), §7 (scheduler/claim), + §8 (factory/lease/heartbeat), §13 (containers + fields), §18 (failure model), + §25 (durability/recovery), §26 (insights). Match these field names + semantics. + +PREREQUISITE / SETUP / BRANCHING: +- Branch off CURRENT `main` of learning_ai_common_plat. +- New branch: feat/gigafactory-p2-foundation. Commit in logical steps (data model, + repos, coordinator, routes, docs). Push + open a PR. DO NOT merge (human gate). +- If node_modules is missing, run `pnpm install` once at the repo root. All tests + must pass with DB_PROVIDER=memory (set it in the test setup if not already). + +STRICT SCOPE: +- Add ONE new module: services/platform-service/src/modules/fleet/ (+ tests there). +- Register the new fleet_* Cosmos containers via the existing registration path. +- Do NOT modify unrelated modules. Do NOT hand-edit template-managed infra + (.npmrc, docker-prep.sh*, tsconfig.base.json, pnpm-workspace.yaml) — they drift. +- Every Cosmos doc MUST include productId. ESM everywhere. No `any` (Zod inference + or explicit types). No console.log (use req.log / app.log). Tests are sacred — + never weaken or delete a test to go green; fix the code. + +================================================================= +PART S1 — DATA MODEL + REPOSITORIES +================================================================= +1. types.ts — Zod schemas + inferred types, each doc carrying productId: + - FleetJobDoc (pk /productId): manifestSnapshot, bodyMd (verbatim instructions), + stage (enum matching the agent-queue lifecycle: + queued|blocked|assigned|building|review|testing|shipped|failed|dead_letter), + idempotencyKey, trackerItemId?, parentId?, kind ('leaf'|'composite' default + 'leaf'), checkpoint? {wipBranch,wipBase,wipCommit}, priority + (critical|high|medium|low), capabilities[], engineClass?, profile?, deps[], + depsMode?, budget? {usd?,tokens?,wall?}, retry? {max,backoff,on[]}, timestamps. + - FleetRunDoc (pk /jobId): jobId, attempt, factoryId?, engine, profileSnapshot?, + startedAt, endedAt?, exit?, verifyResult?, result?, insights {model?,tokensIn?, + tokensOut?,tokensCached?,costUsd?,estimated?,turns?,toolCalls?,filesChanged?, + linesAdded?,linesDeleted?}. + - FleetLeaseDoc (pk /jobId): jobId, holderFactoryId?, expiresAt?, leaseEpoch + (number, default 0), renewals (number), status (held|expired|released). + - FleetFactoryDoc (pk /productId): factoryId, descriptor, capabilities[], + health (ok|degraded|down), load, seatLimit, lastHeartbeatAt. + - FleetProfileDoc (pk /productId): name, version, immutable snapshot. + - FleetEventDoc (pk /jobId): append-only {type, at, actor?, data}. + - FleetArtifactDoc (pk /jobId): pointers to blob-stored artifacts (no inline logs). +2. repository.ts — one repo per container on the datastore abstraction (memory + + cosmos): create, getById, list (by productId; jobs also by stage + by + idempotencyKey), update (returning/honoring _etag), delete where sensible, + appendEvent(jobId,event). Partition-aware; no cross-partition fan-out in hot paths. +3. Register all fleet_* containers with correct partition keys. + +================================================================= +PART S2 — CONCURRENCY CORE (claim / lease / fencing / heartbeat / reaper) +================================================================= +4. ATOMIC CLAIM (the heart): `claimNextJob(factory)` selects the highest-priority, + oldest eligible job whose stage is `queued` AND whose deps are satisfied AND + whose capabilities are a subset of the factory's, then atomically transitions it + to `assigned` and creates/acquires its lease — guarded by _etag / If-Match so + that under contention EXACTLY ONE factory wins; losers get a conflict and retry + the selection. No double-assignment, ever. +5. LEASES + FENCING: acquiring a lease increments `leaseEpoch`. `renewLease`, + `releaseLease`. Every state-mutating call from a worker carries its leaseEpoch; + a call whose epoch is < the current epoch is REJECTED (fencing) — a stale/zombie + worker can never overwrite a reassigned job's state. +6. HEARTBEAT: `heartbeat(factoryId)` updates lastHeartbeatAt + load/health. +7. REAPER: `reapExpiredLeases(now)` scans leases with expiresAt < now, marks them + expired, bumps leaseEpoch, and returns the job to `queued` (or `blocked` if deps + now unmet) for re-claim — resume-from-checkpoint friendly (checkpoint pointer + preserved on the job). Reaper is idempotent. (Cosmos TTL does NOT do this — the + reaper must; document why.) +8. IDEMPOTENCY: submit with an existing idempotencyKey + identical content => returns + the existing job (no dup); same key + different content while still queued => + supersede; otherwise 409. (Mirror the agent-queue Slice 1 semantics.) +9. DEPS: a job is `blocked` until each dep reaches shipped (or testing when + depsMode:soft); submit-time cycle detection rejects cyclic graphs. + +10. routes.ts — guarded REST under the existing auth + productId middleware: + POST /fleet/jobs (submit, idempotent), GET /fleet/jobs (list by stage), + GET /fleet/jobs/:id, PATCH /fleet/jobs/:id (fenced state transition), + POST /fleet/claim (atomic claim for a factory), + POST /fleet/jobs/:id/lease/renew, POST /fleet/jobs/:id/lease/release, + POST /fleet/factories/heartbeat, GET /fleet/jobs/:id/runs, + GET /fleet/jobs/:id/events. Validate every body with the Zod schemas. Register + the module in the app exactly as items is registered. + +================================================================= +TESTS (Vitest — write alongside; memory provider; tests are sacred) +================================================================= +- schema validation: valid docs pass; missing productId / bad enum fail precisely + (>=1 invalid case per container). +- repo CRUD round-trip per container; list filters by productId, by stage, by + idempotencyKey; appendEvent yields an ordered append-only stream. +- ATOMIC CLAIM RACE: two claims contending for the SAME job version (same _etag) => + exactly one succeeds, the other gets a conflict; assert no double-assignment. + (Deterministic: drive via the conditional/If-Match update, not real threads.) +- priority+age selection: among eligible queued jobs, claim returns the + highest-priority then oldest. +- deps gating: a job with unmet deps is `blocked` and NOT claimable; becomes + claimable once deps reach shipped; depsMode:soft satisfied at testing; cycle + rejected at submit. +- FENCING: a state-mutating call with a stale leaseEpoch is rejected; the current + epoch succeeds. +- REAPER: an expired lease => job back to queued, leaseEpoch bumped, checkpoint + preserved; running the reaper twice is idempotent. +- HEARTBEAT updates lastHeartbeatAt/health; a stale factory is detectable. +- IDEMPOTENT submit: same key+content => 1 job; key+changed content while queued => + superseded; otherwise 409. +- routes: submit+claim+renew+release+heartbeat+patch via fastify inject (shared + testing helpers); auth + productId enforced; invalid body rejected. + +VERIFY GATE (must all pass before finishing): +- pnpm --filter @lysnrai/platform-service typecheck +- pnpm --filter @lysnrai/platform-service test (all new tests green; none weakened) +- pnpm --filter @lysnrai/platform-service build +Run the full repo gate too if quick: `pnpm build && pnpm test && pnpm typecheck`. + +DOCS: +- A module README (or docblock) describing each container, the claim/lease/fence + protocol, and the reaper. In your REPORT, list which roadmap §4/§7/§8/§13/§18 + items are now satisfied (I will tick them in the devops-tools repo — you must NOT + edit that repo). + +CONSTRAINTS: follow items-module conventions precisely; ESM .js imports; no any; no +console.log; productId on every doc; conventional commits +(feat(platform-service): ...); do not touch template-managed infra. + +FINAL OUTPUT — print the report in EXACTLY this format: + +## Implementation Report — Phase 2 Foundation (fleet module + coordinator) +### Branch & commits +- branch / based-on / PR +- commits: (one per line) +### Files changed +- : +### What was implemented +- S1 data model: +- S2 concurrency: +- idempotency + deps + cycle detection: +### Tests added +- : (esp. the atomic-claim race, fencing, reaper tests) +- pnpm test summary: +### Verify gate results +- typecheck / test / build (+ full-repo gate if run): +### Roadmap items now satisfied +- §4: <...> §7: <...> §8: <...> §13: <...> §18: <...> +### Deviations / assumptions +- +### Suggested next slice +- Phase 2 Slice 3: factory-agent integration — agent-queue.sh registers/heartbeats/ + claims/reports against this coordinator behind a flag, preserving offline mode; + plus the tracker echo wired through fleet_events.