bytelyst-devops-tools/agent-queue/docs/jobs/phase2-slice1.md

126 lines
6.6 KiB
Markdown

---
engine: devin
cwd: /Users/sd9235/code/mygh/learning_ai_common_plat
yolo: true
lock: common-plat
timeout: 4h
---
ROLE: Senior backend/distributed-systems engineer. Implement Phase 2 — Slice 1:
the FLEET DATA MODEL + REPOSITORIES as a new platform-service module. This is the
durable backbone (§13) that supersedes the single-host stand-ins. NO atomic
claim/lease/fencing logic yet — that is Phase 2 Slice 2. This slice is schemas,
repositories, container registration, basic guarded CRUD, and tests.
NOTE: This runs in a DIFFERENT repo (learning_ai_common_plat), so it does NOT
conflict with the agent-queue (devops-tools) slices and can run independently.
READ FIRST (this is NOT the platform-service you may assume — verify conventions):
- services/platform-service/src/modules/items/{types,repository,routes}.ts — copy
this module pattern EXACTLY (types.ts -> repository.ts -> routes.ts, Zod schemas,
the cloud-agnostic datastore, productId on every doc, req.log/app.log).
- packages/cosmos (container registry) + how existing modules register containers.
- The fleet container spec in the roadmap: agent-queue/docs/GIGAFACTORY_ROADMAP.md
§13 lives in the devops-tools repo at ../learning_ai_devops_tools — read it for
the field lists (fleet_jobs incl. bodyMd + checkpoint; fleet_runs incl. token/
cost/tool/diff insights; fleet_leases incl. leaseEpoch; fleet_factories;
fleet_profiles; fleet_events; fleet_artifacts) and §25/§26.
PREREQUISITE / BRANCHING:
- Branch off CURRENT `main` of learning_ai_common_plat.
- New branch: feat/gigafactory-p2-slice1. Push + open a PR. DO NOT merge.
STRICT SCOPE:
- Add a NEW module: services/platform-service/src/modules/fleet/ (+ its tests).
- Register the new Cosmos containers via the existing registration path.
- Do NOT modify unrelated modules. Do NOT hand-edit shared infra (.npmrc,
docker-prep.sh, tsconfig.base, pnpm-workspace) — those are template-managed.
- ESM everywhere ("type": "module", .js import suffixes). No `any` (Zod inference
or explicit types). No console.log (use req.log/app.log). Every Cosmos doc has
productId. Tests are sacred.
DELIVERABLES
1. types.ts — Zod schemas + inferred types for each container, each with productId:
- FleetJobDoc (pk /productId): manifestSnapshot, bodyMd (verbatim instructions),
stage, idempotencyKey, trackerItemId?, parentId?, kind ('leaf'|'composite',
default 'leaf'), checkpoint? { wipBranch, wipBase, wipCommit }, priority,
capabilities[], engineClass?, profile?, deps[], depsMode?, timestamps.
- FleetRunDoc (pk /jobId): jobId, attempt, factoryId?, engine, profileSnapshot?,
startedAt, endedAt?, exit?, verifyResult?, result?, and insights: model?,
tokensIn?, tokensOut?, tokensCached?, costUsd?, estimated?, turns?, toolCalls?,
filesChanged?, linesAdded?, linesDeleted?.
- FleetLeaseDoc (pk /jobId): jobId, holderFactoryId?, expiresAt?, leaseEpoch
(number, default 0), renewals, status. (Fields only — reclaim/claim logic is S2.)
- FleetFactoryDoc (pk /productId): factoryId, descriptor, capabilities[], health,
load, lastHeartbeatAt, seatLimit.
- FleetProfileDoc (pk /productId): name, version, immutable snapshot (persona,
defaults). FleetEventDoc (pk /jobId): append-only event { type, at, data }.
FleetArtifactDoc (pk /jobId): pointers to blob-stored artifacts (no inline logs).
- Define enums for stage and result that MATCH the agent-queue lifecycle.
2. repository.ts — one repository per container using the existing datastore
abstraction (so DB_PROVIDER=memory works in tests, cosmos in prod):
- CRUD: create, getById, list (by productId; jobs also by stage), update
(optimistic via _etag where the datastore supports it — expose the etag,
even though the ATOMIC claim flow is S2), delete where sensible.
- appendEvent(jobId, event) for the append-only fleet_events stream.
- All queries partition-aware; no cross-partition fan-out in hot paths.
3. container registration — register all fleet_* containers with correct partition
keys via the existing cosmos container registry; memory provider auto-handles.
4. routes.ts — minimal guarded REST under the existing auth + productId middleware:
- POST /fleet/jobs (create), GET /fleet/jobs (list by stage/productId),
GET /fleet/jobs/:id, PATCH /fleet/jobs/:id (stage/fields), and read endpoints
for runs (GET /fleet/jobs/:id/runs) + events. Keep it thin — claim/lease
endpoints are S2. Validate all bodies with the Zod schemas.
- Register the route module in the platform-service app the same way items does.
TESTS (Vitest — write alongside; memory provider; tests sacred):
- schema validation: valid docs pass; missing productId / bad enum fail with
precise errors; at least one invalid case per container.
- repository CRUD round-trip per container (create→get→list→update→delete) on the
memory provider; list filters by productId and by stage (jobs).
- appendEvent produces an ordered, append-only stream for a jobId.
- routes: create+get+list+patch a job via fastify inject (use the shared testing
helpers); auth/productId enforced; invalid body rejected.
- _etag surfaced on update (lost-update guard groundwork) — assert the etag flows.
VERIFY GATE (must pass):
- pnpm --filter @lysnrai/platform-service typecheck
- pnpm --filter @lysnrai/platform-service test (new tests green; none weakened)
- pnpm --filter @lysnrai/platform-service build
DOCS:
- Short module README or header docblock describing the containers + that
claim/lease/fencing is Phase 2 Slice 2.
- In ../learning_ai_devops_tools roadmap you may NOT edit (different repo) — instead
note in your report which §13 items are now satisfied so I can tick them.
CONSTRAINTS: follow the items-module conventions precisely; ESM .js imports; no any;
no console.log; productId everywhere; conventional commits (feat(platform-service):
...); do not touch template-managed infra files.
FINAL OUTPUT — print the report in EXACTLY this format:
## Implementation Report — Phase 2 Slice 1
### Branch & commits
- branch / based-on / PR
- commits: <sha> <message>
### Files changed
- <path>: <one-line summary>
### What was implemented (1-4)
- containers + schemas + repos + routes; partition keys; etag handling
### Tests added
- <test name>: <assertion> (+ pnpm test summary: N passed)
### Verify gate results
- typecheck / test / build: <results>
### §13 items now satisfied
- <list which roadmap §13 boxes are done so the human can tick them>
### Deviations / assumptions
- <datastore/etag/provider choices>
### Suggested next slice
- Phase 2 Slice 2: atomic claim (_etag/If-Match) + lease renew/release + heartbeat
+ reaper + fencing (leaseEpoch) — the concurrency core.