docs(platform-service): fleet module README (containers, claim/lease/fence protocol, reaper)
This commit is contained in:
parent
8eb02c48aa
commit
95dd7aa1d0
79
services/platform-service/src/modules/fleet/README.md
Normal file
79
services/platform-service/src/modules/fleet/README.md
Normal file
@ -0,0 +1,79 @@
|
||||
# Fleet module — agent gigafactory coordinator (Phase 2 foundation)
|
||||
|
||||
Product-agnostic, cloud-agnostic coordinator for distributed agent jobs. This is
|
||||
the durable backend that supersedes the single-host stand-ins built in the
|
||||
`agent-queue` (devops-tools) repo. Everything runs on the `@bytelyst/datastore`
|
||||
abstraction, so all tests execute on `DB_PROVIDER=memory` (no Cosmos/network).
|
||||
|
||||
Spec: `../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md`
|
||||
(§4 core contract, §7 scheduler/claim, §8 factory/lease/heartbeat, §13 containers,
|
||||
§18 failure model, §25 durability/recovery, §26 insights).
|
||||
|
||||
## Containers (partition keys)
|
||||
|
||||
| Container | PK | Purpose |
|
||||
| ----------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `fleet_jobs` | `/productId` | durable job: `manifestSnapshot`, verbatim `bodyMd`, `stage`, `idempotencyKey`, `deps`, `checkpoint`, `priority`, `rev`, `leaseEpoch`, … |
|
||||
| `fleet_runs` | `/jobId` | one execution attempt: engine, timings, `result`, `insights` (tokens/cost/diff) |
|
||||
| `fleet_leases` | `/jobId` | single-holder lease: `holderFactoryId`, `expiresAt`, `leaseEpoch`, `status` |
|
||||
| `fleet_factories` | `/productId` | registered worker host: `capabilities`, `health`, `load`, `seatLimit`, `lastHeartbeatAt` |
|
||||
| `fleet_profiles` | `/productId` | immutable, versioned profile snapshot |
|
||||
| `fleet_events` | `/jobId` | append-only audit/event stream (monotonic `seq`) |
|
||||
| `fleet_artifacts` | `/jobId` | pointers to blob-stored artifacts (no inline logs) |
|
||||
|
||||
Every document carries `productId`. Containers are registered in
|
||||
`lib/cosmos-init.ts`.
|
||||
|
||||
## Concurrency protocol
|
||||
|
||||
**Optimistic concurrency (`rev`).** Jobs and leases carry a monotonic `rev` token.
|
||||
`repository.revUpdate*` is a compare-and-swap: it writes only if the stored `rev`
|
||||
still equals the caller's expected `rev`, else it reports `conflict` and writes
|
||||
nothing. In production (Cosmos) this maps to `_etag` / `If-Match`; on the memory
|
||||
provider it is enforced by re-reading `rev` immediately before the write, which is
|
||||
exact for the sequential calls the coordinator and tests make.
|
||||
|
||||
**Atomic claim (`claimNextJob`).** Select the highest-priority, oldest job that is
|
||||
`queued` (or `blocked` with now-satisfied deps) and whose `capabilities` are a
|
||||
subset of the factory's, then `tryClaimJob` does a `rev` CAS to flip it to
|
||||
`assigned` and acquire/create its lease. Under contention exactly one factory wins
|
||||
the CAS; losers get `conflict` and re-select. No double-assignment, ever.
|
||||
|
||||
**Leases + fencing.** Acquiring/reclaiming a lease increments `leaseEpoch`. Every
|
||||
worker mutation (`patchJobFenced`, `renewLease`, `releaseLease`) carries its
|
||||
`leaseEpoch`; a call whose epoch is `< job.leaseEpoch` is rejected (`fenced`) — a
|
||||
stale/zombie worker can never overwrite a reassigned job.
|
||||
|
||||
**Heartbeat.** `heartbeat(factory)` upserts `lastHeartbeatAt` + health/load;
|
||||
`isFactoryStale` detects a missed-heartbeat factory.
|
||||
|
||||
**Reaper.** `reapExpiredLeases(now)` scans `held` leases with `expiresAt < now`,
|
||||
bumps `leaseEpoch` (fencing the dead holder), returns the job to `queued` (or
|
||||
`blocked` if deps are now unmet) **preserving the `checkpoint` pointer** (resume
|
||||
from WIP), and marks the lease `expired`. Idempotent — a reaped lease is no longer
|
||||
`held`, so a second pass reaps nothing. Cosmos TTL cannot do this (it only deletes
|
||||
the lease doc; it cannot requeue the job, bump the epoch, or keep the checkpoint),
|
||||
so the reaper — not TTL — owns recovery.
|
||||
|
||||
## Submit semantics (idempotency + deps)
|
||||
|
||||
- same `idempotencyKey` + identical `bodyMd` → returns the existing job (dedup).
|
||||
- same key + different content while still `queued`/`blocked` → supersede in place.
|
||||
- same key + different content once past `queued` → `409 Conflict`.
|
||||
- a job with unmet `deps` is `blocked` (a dep is met at `shipped`, or `testing`
|
||||
when `depsMode: soft`); submit-time cycle detection rejects cyclic graphs.
|
||||
|
||||
## REST (under `/api`, auth + productId)
|
||||
|
||||
`POST /fleet/jobs` · `GET /fleet/jobs` · `GET /fleet/jobs/:id` ·
|
||||
`PATCH /fleet/jobs/:id` (fenced) · `POST /fleet/claim` ·
|
||||
`POST /fleet/jobs/:id/lease/renew` · `POST /fleet/jobs/:id/lease/release` ·
|
||||
`POST /fleet/factories/heartbeat` · `GET /fleet/jobs/:id/runs` ·
|
||||
`GET /fleet/jobs/:id/events`.
|
||||
|
||||
## Files
|
||||
|
||||
`types.ts` (Zod schemas → inferred types) · `repository.ts` (per-container repos +
|
||||
`revUpdate` CAS) · `coordinator.ts` (claim/lease/fence/heartbeat/reaper + submit) ·
|
||||
`routes.ts` (REST) · `*.test.ts` (schema, repo, coordinator incl. the atomic-claim
|
||||
race / fencing / reaper, and route inject tests).
|
||||
Loading…
Reference in New Issue
Block a user