bytelyst-devops-tools/agent-queue/docs/jobs/phase2-foundation.md

10 KiB

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_common_plat true common-plat 5h

ROLE: Senior backend / distributed-systems engineer. Implement the PHASE 2 FOUNDATION of the agent gigafactory: a new fleet module in platform-service covering (S1) the durable data model + repositories AND (S2) the CONCURRENCY CORE — atomic claim, leases, fencing, heartbeat, and a reaper. This is one long, self-contained backend slice. It supersedes the single-host stand-ins built in the agent-queue (devops-tools) repo.

WHY THIS IS A SAFE LONG (UNATTENDED) RUN: everything is in ONE repo (learning_ai_common_plat), all logic is TypeScript, and ALL tests run on the in-memory datastore provider (DB_PROVIDER=memory) — NO live platform-service, NO Cosmos, NO network calls, NO tokens required. There are no external blockers.

READ FIRST (this is NOT the platform-service you may assume — verify conventions):

  • services/platform-service/src/modules/items/{types,repository,routes}.ts — copy this module pattern EXACTLY: types.ts -> repository.ts -> routes.ts, Zod schemas, the cloud-agnostic datastore, productId on every doc, req.log/app.log, ESM with .js import suffixes, no any, no console.log.
  • packages/datastore (or the existing datastore abstraction) — how repositories are built, how optimistic concurrency (_etag / If-Match) is exposed, and how the memory vs cosmos provider is selected (DB_PROVIDER).
  • packages/cosmos container registry — how containers are registered.
  • The fleet spec lives in the sibling devops-tools repo (read-only): ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md §4 (core contract: idempotency/atomic-claim/fencing/lease), §7 (scheduler/claim), §8 (factory/lease/heartbeat), §13 (containers + fields), §18 (failure model), §25 (durability/recovery), §26 (insights). Match these field names + semantics.

PREREQUISITE / SETUP / BRANCHING:

  • Branch off CURRENT main of learning_ai_common_plat.
  • New branch: feat/gigafactory-p2-foundation. Commit in logical steps (data model, repos, coordinator, routes, docs). Push + open a PR. DO NOT merge (human gate).
  • If node_modules is missing, run pnpm install once at the repo root. All tests must pass with DB_PROVIDER=memory (set it in the test setup if not already).

STRICT SCOPE:

  • Add ONE new module: services/platform-service/src/modules/fleet/ (+ tests there).
  • Register the new fleet_* Cosmos containers via the existing registration path.
  • Do NOT modify unrelated modules. Do NOT hand-edit template-managed infra (.npmrc, docker-prep.sh*, tsconfig.base.json, pnpm-workspace.yaml) — they drift.
  • Every Cosmos doc MUST include productId. ESM everywhere. No any (Zod inference or explicit types). No console.log (use req.log / app.log). Tests are sacred — never weaken or delete a test to go green; fix the code.

================================================================= PART S1 — DATA MODEL + REPOSITORIES

  1. types.ts — Zod schemas + inferred types, each doc carrying productId:
    • FleetJobDoc (pk /productId): manifestSnapshot, bodyMd (verbatim instructions), stage (enum matching the agent-queue lifecycle: queued|blocked|assigned|building|review|testing|shipped|failed|dead_letter), idempotencyKey, trackerItemId?, parentId?, kind ('leaf'|'composite' default 'leaf'), checkpoint? {wipBranch,wipBase,wipCommit}, priority (critical|high|medium|low), capabilities[], engineClass?, profile?, deps[], depsMode?, budget? {usd?,tokens?,wall?}, retry? {max,backoff,on[]}, timestamps.
    • FleetRunDoc (pk /jobId): jobId, attempt, factoryId?, engine, profileSnapshot?, startedAt, endedAt?, exit?, verifyResult?, result?, insights {model?,tokensIn?, tokensOut?,tokensCached?,costUsd?,estimated?,turns?,toolCalls?,filesChanged?, linesAdded?,linesDeleted?}.
    • FleetLeaseDoc (pk /jobId): jobId, holderFactoryId?, expiresAt?, leaseEpoch (number, default 0), renewals (number), status (held|expired|released).
    • FleetFactoryDoc (pk /productId): factoryId, descriptor, capabilities[], health (ok|degraded|down), load, seatLimit, lastHeartbeatAt.
    • FleetProfileDoc (pk /productId): name, version, immutable snapshot.
    • FleetEventDoc (pk /jobId): append-only {type, at, actor?, data}.
    • FleetArtifactDoc (pk /jobId): pointers to blob-stored artifacts (no inline logs).
  2. repository.ts — one repo per container on the datastore abstraction (memory + cosmos): create, getById, list (by productId; jobs also by stage + by idempotencyKey), update (returning/honoring _etag), delete where sensible, appendEvent(jobId,event). Partition-aware; no cross-partition fan-out in hot paths.
  3. Register all fleet_* containers with correct partition keys.

================================================================= PART S2 — CONCURRENCY CORE (claim / lease / fencing / heartbeat / reaper)

  1. ATOMIC CLAIM (the heart): claimNextJob(factory) selects the highest-priority, oldest eligible job whose stage is queued AND whose deps are satisfied AND whose capabilities are a subset of the factory's, then atomically transitions it to assigned and creates/acquires its lease — guarded by _etag / If-Match so that under contention EXACTLY ONE factory wins; losers get a conflict and retry the selection. No double-assignment, ever.

  2. LEASES + FENCING: acquiring a lease increments leaseEpoch. renewLease, releaseLease. Every state-mutating call from a worker carries its leaseEpoch; a call whose epoch is < the current epoch is REJECTED (fencing) — a stale/zombie worker can never overwrite a reassigned job's state.

  3. HEARTBEAT: heartbeat(factoryId) updates lastHeartbeatAt + load/health.

  4. REAPER: reapExpiredLeases(now) scans leases with expiresAt < now, marks them expired, bumps leaseEpoch, and returns the job to queued (or blocked if deps now unmet) for re-claim — resume-from-checkpoint friendly (checkpoint pointer preserved on the job). Reaper is idempotent. (Cosmos TTL does NOT do this — the reaper must; document why.)

  5. IDEMPOTENCY: submit with an existing idempotencyKey + identical content => returns the existing job (no dup); same key + different content while still queued => supersede; otherwise 409. (Mirror the agent-queue Slice 1 semantics.)

  6. DEPS: a job is blocked until each dep reaches shipped (or testing when depsMode:soft); submit-time cycle detection rejects cyclic graphs.

  7. routes.ts — guarded REST under the existing auth + productId middleware: POST /fleet/jobs (submit, idempotent), GET /fleet/jobs (list by stage), GET /fleet/jobs/:id, PATCH /fleet/jobs/:id (fenced state transition), POST /fleet/claim (atomic claim for a factory), POST /fleet/jobs/:id/lease/renew, POST /fleet/jobs/:id/lease/release, POST /fleet/factories/heartbeat, GET /fleet/jobs/:id/runs, GET /fleet/jobs/:id/events. Validate every body with the Zod schemas. Register the module in the app exactly as items is registered.

================================================================= TESTS (Vitest — write alongside; memory provider; tests are sacred)

  • schema validation: valid docs pass; missing productId / bad enum fail precisely (>=1 invalid case per container).
  • repo CRUD round-trip per container; list filters by productId, by stage, by idempotencyKey; appendEvent yields an ordered append-only stream.
  • ATOMIC CLAIM RACE: two claims contending for the SAME job version (same _etag) => exactly one succeeds, the other gets a conflict; assert no double-assignment. (Deterministic: drive via the conditional/If-Match update, not real threads.)
  • priority+age selection: among eligible queued jobs, claim returns the highest-priority then oldest.
  • deps gating: a job with unmet deps is blocked and NOT claimable; becomes claimable once deps reach shipped; depsMode:soft satisfied at testing; cycle rejected at submit.
  • FENCING: a state-mutating call with a stale leaseEpoch is rejected; the current epoch succeeds.
  • REAPER: an expired lease => job back to queued, leaseEpoch bumped, checkpoint preserved; running the reaper twice is idempotent.
  • HEARTBEAT updates lastHeartbeatAt/health; a stale factory is detectable.
  • IDEMPOTENT submit: same key+content => 1 job; key+changed content while queued => superseded; otherwise 409.
  • routes: submit+claim+renew+release+heartbeat+patch via fastify inject (shared testing helpers); auth + productId enforced; invalid body rejected.

VERIFY GATE (must all pass before finishing):

  • pnpm --filter @lysnrai/platform-service typecheck
  • pnpm --filter @lysnrai/platform-service test (all new tests green; none weakened)
  • pnpm --filter @lysnrai/platform-service build Run the full repo gate too if quick: pnpm build && pnpm test && pnpm typecheck.

DOCS:

  • A module README (or docblock) describing each container, the claim/lease/fence protocol, and the reaper. In your REPORT, list which roadmap §4/§7/§8/§13/§18 items are now satisfied (I will tick them in the devops-tools repo — you must NOT edit that repo).

CONSTRAINTS: follow items-module conventions precisely; ESM .js imports; no any; no console.log; productId on every doc; conventional commits (feat(platform-service): ...); do not touch template-managed infra.

FINAL OUTPUT — print the report in EXACTLY this format:

Implementation Report — Phase 2 Foundation (fleet module + coordinator)

Branch & commits

  • branch / based-on / PR
  • commits: (one per line)

Files changed

  • :

What was implemented

  • S1 data model: <containers, partition keys, etag handling>
  • S2 concurrency: <claim algorithm, lease/fencing via leaseEpoch, reaper, heartbeat>
  • idempotency + deps + cycle detection:

Tests added

  • : (esp. the atomic-claim race, fencing, reaper tests)
  • pnpm test summary:

Verify gate results

  • typecheck / test / build (+ full-repo gate if run):

Roadmap items now satisfied

  • §4: <...> §7: <...> §8: <...> §13: <...> §18: <...>

Deviations / assumptions

  • <datastore concurrency model, how the race test is made deterministic, anything stubbed>

Suggested next slice

  • Phase 2 Slice 3: factory-agent integration — agent-queue.sh registers/heartbeats/ claims/reports against this coordinator behind a flag, preserving offline mode; plus the tracker echo wired through fleet_events.