docs(agent-queue): draft Phase 2 Foundation long-run prompt (fleet module + coordinator: claim/lease/fencing/reaper)
This commit is contained in:
parent
d43cab8afe
commit
0e94705ab7
179
agent-queue/docs/jobs/phase2-foundation.md
Normal file
179
agent-queue/docs/jobs/phase2-foundation.md
Normal file
@ -0,0 +1,179 @@
|
||||
---
|
||||
engine: devin
|
||||
cwd: /Users/sd9235/code/mygh/learning_ai_common_plat
|
||||
yolo: true
|
||||
lock: common-plat
|
||||
timeout: 5h
|
||||
---
|
||||
|
||||
ROLE: Senior backend / distributed-systems engineer. Implement the PHASE 2
|
||||
FOUNDATION of the agent gigafactory: a new `fleet` module in platform-service
|
||||
covering (S1) the durable data model + repositories AND (S2) the CONCURRENCY CORE
|
||||
— atomic claim, leases, fencing, heartbeat, and a reaper. This is one long,
|
||||
self-contained backend slice. It supersedes the single-host stand-ins built in the
|
||||
agent-queue (devops-tools) repo.
|
||||
|
||||
WHY THIS IS A SAFE LONG (UNATTENDED) RUN: everything is in ONE repo
|
||||
(learning_ai_common_plat), all logic is TypeScript, and ALL tests run on the
|
||||
in-memory datastore provider (DB_PROVIDER=memory) — NO live platform-service, NO
|
||||
Cosmos, NO network calls, NO tokens required. There are no external blockers.
|
||||
|
||||
READ FIRST (this is NOT the platform-service you may assume — verify conventions):
|
||||
- services/platform-service/src/modules/items/{types,repository,routes}.ts — copy
|
||||
this module pattern EXACTLY: types.ts -> repository.ts -> routes.ts, Zod schemas,
|
||||
the cloud-agnostic datastore, productId on every doc, req.log/app.log, ESM with
|
||||
.js import suffixes, no `any`, no console.log.
|
||||
- packages/datastore (or the existing datastore abstraction) — how repositories are
|
||||
built, how optimistic concurrency (_etag / If-Match) is exposed, and how the
|
||||
memory vs cosmos provider is selected (DB_PROVIDER).
|
||||
- packages/cosmos container registry — how containers are registered.
|
||||
- The fleet spec lives in the sibling devops-tools repo (read-only):
|
||||
../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md
|
||||
§4 (core contract: idempotency/atomic-claim/fencing/lease), §7 (scheduler/claim),
|
||||
§8 (factory/lease/heartbeat), §13 (containers + fields), §18 (failure model),
|
||||
§25 (durability/recovery), §26 (insights). Match these field names + semantics.
|
||||
|
||||
PREREQUISITE / SETUP / BRANCHING:
|
||||
- Branch off CURRENT `main` of learning_ai_common_plat.
|
||||
- New branch: feat/gigafactory-p2-foundation. Commit in logical steps (data model,
|
||||
repos, coordinator, routes, docs). Push + open a PR. DO NOT merge (human gate).
|
||||
- If node_modules is missing, run `pnpm install` once at the repo root. All tests
|
||||
must pass with DB_PROVIDER=memory (set it in the test setup if not already).
|
||||
|
||||
STRICT SCOPE:
|
||||
- Add ONE new module: services/platform-service/src/modules/fleet/ (+ tests there).
|
||||
- Register the new fleet_* Cosmos containers via the existing registration path.
|
||||
- Do NOT modify unrelated modules. Do NOT hand-edit template-managed infra
|
||||
(.npmrc, docker-prep.sh*, tsconfig.base.json, pnpm-workspace.yaml) — they drift.
|
||||
- Every Cosmos doc MUST include productId. ESM everywhere. No `any` (Zod inference
|
||||
or explicit types). No console.log (use req.log / app.log). Tests are sacred —
|
||||
never weaken or delete a test to go green; fix the code.
|
||||
|
||||
=================================================================
|
||||
PART S1 — DATA MODEL + REPOSITORIES
|
||||
=================================================================
|
||||
1. types.ts — Zod schemas + inferred types, each doc carrying productId:
|
||||
- FleetJobDoc (pk /productId): manifestSnapshot, bodyMd (verbatim instructions),
|
||||
stage (enum matching the agent-queue lifecycle:
|
||||
queued|blocked|assigned|building|review|testing|shipped|failed|dead_letter),
|
||||
idempotencyKey, trackerItemId?, parentId?, kind ('leaf'|'composite' default
|
||||
'leaf'), checkpoint? {wipBranch,wipBase,wipCommit}, priority
|
||||
(critical|high|medium|low), capabilities[], engineClass?, profile?, deps[],
|
||||
depsMode?, budget? {usd?,tokens?,wall?}, retry? {max,backoff,on[]}, timestamps.
|
||||
- FleetRunDoc (pk /jobId): jobId, attempt, factoryId?, engine, profileSnapshot?,
|
||||
startedAt, endedAt?, exit?, verifyResult?, result?, insights {model?,tokensIn?,
|
||||
tokensOut?,tokensCached?,costUsd?,estimated?,turns?,toolCalls?,filesChanged?,
|
||||
linesAdded?,linesDeleted?}.
|
||||
- FleetLeaseDoc (pk /jobId): jobId, holderFactoryId?, expiresAt?, leaseEpoch
|
||||
(number, default 0), renewals (number), status (held|expired|released).
|
||||
- FleetFactoryDoc (pk /productId): factoryId, descriptor, capabilities[],
|
||||
health (ok|degraded|down), load, seatLimit, lastHeartbeatAt.
|
||||
- FleetProfileDoc (pk /productId): name, version, immutable snapshot.
|
||||
- FleetEventDoc (pk /jobId): append-only {type, at, actor?, data}.
|
||||
- FleetArtifactDoc (pk /jobId): pointers to blob-stored artifacts (no inline logs).
|
||||
2. repository.ts — one repo per container on the datastore abstraction (memory +
|
||||
cosmos): create, getById, list (by productId; jobs also by stage + by
|
||||
idempotencyKey), update (returning/honoring _etag), delete where sensible,
|
||||
appendEvent(jobId,event). Partition-aware; no cross-partition fan-out in hot paths.
|
||||
3. Register all fleet_* containers with correct partition keys.
|
||||
|
||||
=================================================================
|
||||
PART S2 — CONCURRENCY CORE (claim / lease / fencing / heartbeat / reaper)
|
||||
=================================================================
|
||||
4. ATOMIC CLAIM (the heart): `claimNextJob(factory)` selects the highest-priority,
|
||||
oldest eligible job whose stage is `queued` AND whose deps are satisfied AND
|
||||
whose capabilities are a subset of the factory's, then atomically transitions it
|
||||
to `assigned` and creates/acquires its lease — guarded by _etag / If-Match so
|
||||
that under contention EXACTLY ONE factory wins; losers get a conflict and retry
|
||||
the selection. No double-assignment, ever.
|
||||
5. LEASES + FENCING: acquiring a lease increments `leaseEpoch`. `renewLease`,
|
||||
`releaseLease`. Every state-mutating call from a worker carries its leaseEpoch;
|
||||
a call whose epoch is < the current epoch is REJECTED (fencing) — a stale/zombie
|
||||
worker can never overwrite a reassigned job's state.
|
||||
6. HEARTBEAT: `heartbeat(factoryId)` updates lastHeartbeatAt + load/health.
|
||||
7. REAPER: `reapExpiredLeases(now)` scans leases with expiresAt < now, marks them
|
||||
expired, bumps leaseEpoch, and returns the job to `queued` (or `blocked` if deps
|
||||
now unmet) for re-claim — resume-from-checkpoint friendly (checkpoint pointer
|
||||
preserved on the job). Reaper is idempotent. (Cosmos TTL does NOT do this — the
|
||||
reaper must; document why.)
|
||||
8. IDEMPOTENCY: submit with an existing idempotencyKey + identical content => returns
|
||||
the existing job (no dup); same key + different content while still queued =>
|
||||
supersede; otherwise 409. (Mirror the agent-queue Slice 1 semantics.)
|
||||
9. DEPS: a job is `blocked` until each dep reaches shipped (or testing when
|
||||
depsMode:soft); submit-time cycle detection rejects cyclic graphs.
|
||||
|
||||
10. routes.ts — guarded REST under the existing auth + productId middleware:
|
||||
POST /fleet/jobs (submit, idempotent), GET /fleet/jobs (list by stage),
|
||||
GET /fleet/jobs/:id, PATCH /fleet/jobs/:id (fenced state transition),
|
||||
POST /fleet/claim (atomic claim for a factory),
|
||||
POST /fleet/jobs/:id/lease/renew, POST /fleet/jobs/:id/lease/release,
|
||||
POST /fleet/factories/heartbeat, GET /fleet/jobs/:id/runs,
|
||||
GET /fleet/jobs/:id/events. Validate every body with the Zod schemas. Register
|
||||
the module in the app exactly as items is registered.
|
||||
|
||||
=================================================================
|
||||
TESTS (Vitest — write alongside; memory provider; tests are sacred)
|
||||
=================================================================
|
||||
- schema validation: valid docs pass; missing productId / bad enum fail precisely
|
||||
(>=1 invalid case per container).
|
||||
- repo CRUD round-trip per container; list filters by productId, by stage, by
|
||||
idempotencyKey; appendEvent yields an ordered append-only stream.
|
||||
- ATOMIC CLAIM RACE: two claims contending for the SAME job version (same _etag) =>
|
||||
exactly one succeeds, the other gets a conflict; assert no double-assignment.
|
||||
(Deterministic: drive via the conditional/If-Match update, not real threads.)
|
||||
- priority+age selection: among eligible queued jobs, claim returns the
|
||||
highest-priority then oldest.
|
||||
- deps gating: a job with unmet deps is `blocked` and NOT claimable; becomes
|
||||
claimable once deps reach shipped; depsMode:soft satisfied at testing; cycle
|
||||
rejected at submit.
|
||||
- FENCING: a state-mutating call with a stale leaseEpoch is rejected; the current
|
||||
epoch succeeds.
|
||||
- REAPER: an expired lease => job back to queued, leaseEpoch bumped, checkpoint
|
||||
preserved; running the reaper twice is idempotent.
|
||||
- HEARTBEAT updates lastHeartbeatAt/health; a stale factory is detectable.
|
||||
- IDEMPOTENT submit: same key+content => 1 job; key+changed content while queued =>
|
||||
superseded; otherwise 409.
|
||||
- routes: submit+claim+renew+release+heartbeat+patch via fastify inject (shared
|
||||
testing helpers); auth + productId enforced; invalid body rejected.
|
||||
|
||||
VERIFY GATE (must all pass before finishing):
|
||||
- pnpm --filter @lysnrai/platform-service typecheck
|
||||
- pnpm --filter @lysnrai/platform-service test (all new tests green; none weakened)
|
||||
- pnpm --filter @lysnrai/platform-service build
|
||||
Run the full repo gate too if quick: `pnpm build && pnpm test && pnpm typecheck`.
|
||||
|
||||
DOCS:
|
||||
- A module README (or docblock) describing each container, the claim/lease/fence
|
||||
protocol, and the reaper. In your REPORT, list which roadmap §4/§7/§8/§13/§18
|
||||
items are now satisfied (I will tick them in the devops-tools repo — you must NOT
|
||||
edit that repo).
|
||||
|
||||
CONSTRAINTS: follow items-module conventions precisely; ESM .js imports; no any; no
|
||||
console.log; productId on every doc; conventional commits
|
||||
(feat(platform-service): ...); do not touch template-managed infra.
|
||||
|
||||
FINAL OUTPUT — print the report in EXACTLY this format:
|
||||
|
||||
## Implementation Report — Phase 2 Foundation (fleet module + coordinator)
|
||||
### Branch & commits
|
||||
- branch / based-on / PR
|
||||
- commits: <sha> <message> (one per line)
|
||||
### Files changed
|
||||
- <path>: <one-line summary>
|
||||
### What was implemented
|
||||
- S1 data model: <containers, partition keys, etag handling>
|
||||
- S2 concurrency: <claim algorithm, lease/fencing via leaseEpoch, reaper, heartbeat>
|
||||
- idempotency + deps + cycle detection: <how>
|
||||
### Tests added
|
||||
- <test name>: <assertion> (esp. the atomic-claim race, fencing, reaper tests)
|
||||
- pnpm test summary: <N passed>
|
||||
### Verify gate results
|
||||
- typecheck / test / build (+ full-repo gate if run): <results>
|
||||
### Roadmap items now satisfied
|
||||
- §4: <...> §7: <...> §8: <...> §13: <...> §18: <...>
|
||||
### Deviations / assumptions
|
||||
- <datastore concurrency model, how the race test is made deterministic, anything stubbed>
|
||||
### Suggested next slice
|
||||
- Phase 2 Slice 3: factory-agent integration — agent-queue.sh registers/heartbeats/
|
||||
claims/reports against this coordinator behind a flag, preserving offline mode;
|
||||
plus the tracker echo wired through fleet_events.
|
||||
Loading…
Reference in New Issue
Block a user