bytelyst-devops-tools/agent-queue/docs/jobs/phase2-slice1.md
Saravanakumar D 257efcb4bc docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:01:23 -07:00

6.7 KiB

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_common_plat true common-plat 4h

ROLE: Senior backend/distributed-systems engineer. Implement Phase 2 — Slice 1: the FLEET DATA MODEL + REPOSITORIES as a new platform-service module. This is the durable backbone (§13) that supersedes the single-host stand-ins. NO atomic claim/lease/fencing logic yet — that is Phase 2 Slice 2. This slice is schemas, repositories, container registration, basic guarded CRUD, and tests.

NOTE: This runs in a DIFFERENT repo (learning_ai_common_plat), so it does NOT conflict with the agent-queue (devops-tools) slices and can run independently.

READ FIRST (this is NOT the platform-service you may assume — verify conventions):

  • services/platform-service/src/modules/items/{types,repository,routes}.ts — copy this module pattern EXACTLY (types.ts -> repository.ts -> routes.ts, Zod schemas, the cloud-agnostic datastore, productId on every doc, req.log/app.log).
  • packages/cosmos (container registry) + how existing modules register containers.
  • The fleet container spec in the roadmap: agent-queue/docs/gigafactory/GIGAFACTORY_ROADMAP.md §13 lives in the devops-tools repo at ../learning_ai_devops_tools — read it for the field lists (fleet_jobs incl. bodyMd + checkpoint; fleet_runs incl. token/ cost/tool/diff insights; fleet_leases incl. leaseEpoch; fleet_factories; fleet_profiles; fleet_events; fleet_artifacts) and §25/§26.

PREREQUISITE / BRANCHING:

  • Branch off CURRENT main of learning_ai_common_plat.
  • New branch: feat/gigafactory-p2-slice1. Push + open a PR. DO NOT merge.

STRICT SCOPE:

  • Add a NEW module: services/platform-service/src/modules/fleet/ (+ its tests).
  • Register the new Cosmos containers via the existing registration path.
  • Do NOT modify unrelated modules. Do NOT hand-edit shared infra (.npmrc, docker-prep.sh, tsconfig.base, pnpm-workspace) — those are template-managed.
  • ESM everywhere ("type": "module", .js import suffixes). No any (Zod inference or explicit types). No console.log (use req.log/app.log). Every Cosmos doc has productId. Tests are sacred.

DELIVERABLES

  1. types.ts — Zod schemas + inferred types for each container, each with productId:

    • FleetJobDoc (pk /productId): manifestSnapshot, bodyMd (verbatim instructions), stage, idempotencyKey, trackerItemId?, parentId?, kind ('leaf'|'composite', default 'leaf'), checkpoint? { wipBranch, wipBase, wipCommit }, priority, capabilities[], engineClass?, profile?, deps[], depsMode?, timestamps.
    • FleetRunDoc (pk /jobId): jobId, attempt, factoryId?, engine, profileSnapshot?, startedAt, endedAt?, exit?, verifyResult?, result?, and insights: model?, tokensIn?, tokensOut?, tokensCached?, costUsd?, estimated?, turns?, toolCalls?, filesChanged?, linesAdded?, linesDeleted?.
    • FleetLeaseDoc (pk /jobId): jobId, holderFactoryId?, expiresAt?, leaseEpoch (number, default 0), renewals, status. (Fields only — reclaim/claim logic is S2.)
    • FleetFactoryDoc (pk /productId): factoryId, descriptor, capabilities[], health, load, lastHeartbeatAt, seatLimit.
    • FleetProfileDoc (pk /productId): name, version, immutable snapshot (persona, defaults). FleetEventDoc (pk /jobId): append-only event { type, at, data }. FleetArtifactDoc (pk /jobId): pointers to blob-stored artifacts (no inline logs).
    • Define enums for stage and result that MATCH the agent-queue lifecycle.
  2. repository.ts — one repository per container using the existing datastore abstraction (so DB_PROVIDER=memory works in tests, cosmos in prod):

    • CRUD: create, getById, list (by productId; jobs also by stage), update (optimistic via _etag where the datastore supports it — expose the etag, even though the ATOMIC claim flow is S2), delete where sensible.
    • appendEvent(jobId, event) for the append-only fleet_events stream.
    • All queries partition-aware; no cross-partition fan-out in hot paths.
  3. container registration — register all fleet_* containers with correct partition keys via the existing cosmos container registry; memory provider auto-handles.

  4. routes.ts — minimal guarded REST under the existing auth + productId middleware:

    • POST /fleet/jobs (create), GET /fleet/jobs (list by stage/productId), GET /fleet/jobs/:id, PATCH /fleet/jobs/:id (stage/fields), and read endpoints for runs (GET /fleet/jobs/:id/runs) + events. Keep it thin — claim/lease endpoints are S2. Validate all bodies with the Zod schemas.
    • Register the route module in the platform-service app the same way items does.

TESTS (Vitest — write alongside; memory provider; tests sacred):

  • schema validation: valid docs pass; missing productId / bad enum fail with precise errors; at least one invalid case per container.
  • repository CRUD round-trip per container (create→get→list→update→delete) on the memory provider; list filters by productId and by stage (jobs).
  • appendEvent produces an ordered, append-only stream for a jobId.
  • routes: create+get+list+patch a job via fastify inject (use the shared testing helpers); auth/productId enforced; invalid body rejected.
  • _etag surfaced on update (lost-update guard groundwork) — assert the etag flows.

VERIFY GATE (must pass):

  • pnpm --filter @lysnrai/platform-service typecheck
  • pnpm --filter @lysnrai/platform-service test (new tests green; none weakened)
  • pnpm --filter @lysnrai/platform-service build

DOCS:

  • Short module README or header docblock describing the containers + that claim/lease/fencing is Phase 2 Slice 2.
  • In ../learning_ai_devops_tools roadmap you may NOT edit (different repo) — instead note in your report which §13 items are now satisfied so I can tick them.

CONSTRAINTS: follow the items-module conventions precisely; ESM .js imports; no any; no console.log; productId everywhere; conventional commits (feat(platform-service): ...); do not touch template-managed infra files.

FINAL OUTPUT — print the report in EXACTLY this format:

Implementation Report — Phase 2 Slice 1

Branch & commits

  • branch / based-on / PR
  • commits:

Files changed

  • :

What was implemented (1-4)

  • containers + schemas + repos + routes; partition keys; etag handling

Tests added

  • : (+ pnpm test summary: N passed)

Verify gate results

  • typecheck / test / build:

§13 items now satisfied

  • <list which roadmap §13 boxes are done so the human can tick them>

Deviations / assumptions

  • <datastore/etag/provider choices>

Suggested next slice

  • Phase 2 Slice 2: atomic claim (_etag/If-Match) + lease renew/release + heartbeat
    • reaper + fencing (leaseEpoch) — the concurrency core.