From 95dd7aa1d03114cbc99ba9e72f0a1c2d4a4e5e44 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Fri, 29 May 2026 20:20:54 -0700 Subject: [PATCH] docs(platform-service): fleet module README (containers, claim/lease/fence protocol, reaper) --- .../src/modules/fleet/README.md | 79 +++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 services/platform-service/src/modules/fleet/README.md diff --git a/services/platform-service/src/modules/fleet/README.md b/services/platform-service/src/modules/fleet/README.md new file mode 100644 index 00000000..a9c21af0 --- /dev/null +++ b/services/platform-service/src/modules/fleet/README.md @@ -0,0 +1,79 @@ +# Fleet module — agent gigafactory coordinator (Phase 2 foundation) + +Product-agnostic, cloud-agnostic coordinator for distributed agent jobs. This is +the durable backend that supersedes the single-host stand-ins built in the +`agent-queue` (devops-tools) repo. Everything runs on the `@bytelyst/datastore` +abstraction, so all tests execute on `DB_PROVIDER=memory` (no Cosmos/network). + +Spec: `../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md` +(§4 core contract, §7 scheduler/claim, §8 factory/lease/heartbeat, §13 containers, +§18 failure model, §25 durability/recovery, §26 insights). + +## Containers (partition keys) + +| Container | PK | Purpose | +| ----------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- | +| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … | +| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) | +| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` | +| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` | +| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot | +| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) | +| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) | + +Every document carries `productId`. Containers are registered in +`lib/cosmos-init.ts`. + +## Concurrency protocol + +**Optimistic concurrency (`rev`).** Jobs and leases carry a monotonic `rev` token. +`repository.revUpdate*` is a compare-and-swap: it writes only if the stored `rev` +still equals the caller's expected `rev`, else it reports `conflict` and writes +nothing. In production (Cosmos) this maps to `_etag` / `If-Match`; on the memory +provider it is enforced by re-reading `rev` immediately before the write, which is +exact for the sequential calls the coordinator and tests make. + +**Atomic claim (`claimNextJob`).** Select the highest-priority, oldest job that is +`queued` (or `blocked` with now-satisfied deps) and whose `capabilities` are a +subset of the factory's, then `tryClaimJob` does a `rev` CAS to flip it to +`assigned` and acquire/create its lease. Under contention exactly one factory wins +the CAS; losers get `conflict` and re-select. No double-assignment, ever. + +**Leases + fencing.** Acquiring/reclaiming a lease increments `leaseEpoch`. Every +worker mutation (`patchJobFenced`, `renewLease`, `releaseLease`) carries its +`leaseEpoch`; a call whose epoch is `< job.leaseEpoch` is rejected (`fenced`) — a +stale/zombie worker can never overwrite a reassigned job. + +**Heartbeat.** `heartbeat(factory)` upserts `lastHeartbeatAt` + health/load; +`isFactoryStale` detects a missed-heartbeat factory. + +**Reaper.** `reapExpiredLeases(now)` scans `held` leases with `expiresAt < now`, +bumps `leaseEpoch` (fencing the dead holder), returns the job to `queued` (or +`blocked` if deps are now unmet) **preserving the `checkpoint` pointer** (resume +from WIP), and marks the lease `expired`. Idempotent — a reaped lease is no longer +`held`, so a second pass reaps nothing. Cosmos TTL cannot do this (it only deletes +the lease doc; it cannot requeue the job, bump the epoch, or keep the checkpoint), +so the reaper — not TTL — owns recovery. + +## Submit semantics (idempotency + deps) + +- same `idempotencyKey` + identical `bodyMd` → returns the existing job (dedup). +- same key + different content while still `queued`/`blocked` → supersede in place. +- same key + different content once past `queued` → `409 Conflict`. +- a job with unmet `deps` is `blocked` (a dep is met at `shipped`, or `testing` + when `depsMode: soft`); submit-time cycle detection rejects cyclic graphs. + +## REST (under `/api`, auth + productId) + +`POST /fleet/jobs` · `GET /fleet/jobs` · `GET /fleet/jobs/:id` · +`PATCH /fleet/jobs/:id` (fenced) · `POST /fleet/claim` · +`POST /fleet/jobs/:id/lease/renew` · `POST /fleet/jobs/:id/lease/release` · +`POST /fleet/factories/heartbeat` · `GET /fleet/jobs/:id/runs` · +`GET /fleet/jobs/:id/events`. + +## Files + +`types.ts` (Zod schemas → inferred types) · `repository.ts` (per-container repos + +`revUpdate` CAS) · `coordinator.ts` (claim/lease/fence/heartbeat/reaper + submit) · +`routes.ts` (REST) · `*.test.ts` (schema, repo, coordinator incl. the atomic-claim +race / fencing / reaper, and route inject tests).