docs(platform-service): fleet module README (containers, claim/lease/fence protocol, reaper)

This commit is contained in:
saravanakumardb1 2026-05-29 20:20:54 -07:00
parent 8eb02c48aa
commit 95dd7aa1d0

View File

@ -0,0 +1,79 @@
# Fleet module — agent gigafactory coordinator (Phase 2 foundation)
Product-agnostic, cloud-agnostic coordinator for distributed agent jobs. This is
the durable backend that supersedes the single-host stand-ins built in the
`agent-queue` (devops-tools) repo. Everything runs on the `@bytelyst/datastore`
abstraction, so all tests execute on `DB_PROVIDER=memory` (no Cosmos/network).
Spec: `../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md`
(§4 core contract, §7 scheduler/claim, §8 factory/lease/heartbeat, §13 containers,
§18 failure model, §25 durability/recovery, §26 insights).
## Containers (partition keys)
| Container | PK | Purpose |
| ----------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … |
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot |
| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) |
| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) |
Every document carries `productId`. Containers are registered in
`lib/cosmos-init.ts`.
## Concurrency protocol
**Optimistic concurrency (`rev`).** Jobs and leases carry a monotonic `rev` token.
`repository.revUpdate*` is a compare-and-swap: it writes only if the stored `rev`
still equals the caller's expected `rev`, else it reports `conflict` and writes
nothing. In production (Cosmos) this maps to `_etag` / `If-Match`; on the memory
provider it is enforced by re-reading `rev` immediately before the write, which is
exact for the sequential calls the coordinator and tests make.
**Atomic claim (`claimNextJob`).** Select the highest-priority, oldest job that is
`queued` (or `blocked` with now-satisfied deps) and whose `capabilities` are a
subset of the factory's, then `tryClaimJob` does a `rev` CAS to flip it to
`assigned` and acquire/create its lease. Under contention exactly one factory wins
the CAS; losers get `conflict` and re-select. No double-assignment, ever.
**Leases + fencing.** Acquiring/reclaiming a lease increments `leaseEpoch`. Every
worker mutation (`patchJobFenced`, `renewLease`, `releaseLease`) carries its
`leaseEpoch`; a call whose epoch is `< job.leaseEpoch` is rejected (`fenced`) — a
stale/zombie worker can never overwrite a reassigned job.
**Heartbeat.** `heartbeat(factory)` upserts `lastHeartbeatAt` + health/load;
`isFactoryStale` detects a missed-heartbeat factory.
**Reaper.** `reapExpiredLeases(now)` scans `held` leases with `expiresAt < now`,
bumps `leaseEpoch` (fencing the dead holder), returns the job to `queued` (or
`blocked` if deps are now unmet) **preserving the `checkpoint` pointer** (resume
from WIP), and marks the lease `expired`. Idempotent — a reaped lease is no longer
`held`, so a second pass reaps nothing. Cosmos TTL cannot do this (it only deletes
the lease doc; it cannot requeue the job, bump the epoch, or keep the checkpoint),
so the reaper — not TTL — owns recovery.
## Submit semantics (idempotency + deps)
- same `idempotencyKey` + identical `bodyMd` → returns the existing job (dedup).
- same key + different content while still `queued`/`blocked` → supersede in place.
- same key + different content once past `queued``409 Conflict`.
- a job with unmet `deps` is `blocked` (a dep is met at `shipped`, or `testing`
when `depsMode: soft`); submit-time cycle detection rejects cyclic graphs.
## REST (under `/api`, auth + productId)
`POST /fleet/jobs` · `GET /fleet/jobs` · `GET /fleet/jobs/:id` ·
`PATCH /fleet/jobs/:id` (fenced) · `POST /fleet/claim` ·
`POST /fleet/jobs/:id/lease/renew` · `POST /fleet/jobs/:id/lease/release` ·
`POST /fleet/factories/heartbeat` · `GET /fleet/jobs/:id/runs` ·
`GET /fleet/jobs/:id/events`.
## Files
`types.ts` (Zod schemas → inferred types) · `repository.ts` (per-container repos +
`revUpdate` CAS) · `coordinator.ts` (claim/lease/fence/heartbeat/reaper + submit) ·
`routes.ts` (REST) · `*.test.ts` (schema, repo, coordinator incl. the atomic-claim
race / fencing / reaper, and route inject tests).