# Agent Gigafactory — Vision & Implementation Roadmap > **One-liner:** Evolve today's single-host `agent-queue` bash runner into a distributed **gigafactory** — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler **auto-picks jobs from a shared inbox and routes each `.md` to the best factory × tool × profile** — built service-side on `platform-service` + `tracker-web`, with the bash runner surviving as the offline edge agent. > **How to use this doc:** It is both a PRD and an execution checklist. Every feature is a `- [ ]` checkbox with **acceptance criteria** and a **verify gate**. A phase is "100% done" only when every box is checked, its gate passes, and the phase **Definition of Done** rubric (§16) is green. Update the progress table (§0) as you go. --- ## 0. Progress tracker | Phase | Theme | Status | % | Gate | | ----- | ----- | ------ | - | ---- | | **0** | Baseline (today) | ✅ shipped | 100% | `selftest.sh` green | | **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 95% | adapter e2e + selftest | | **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ◐ in progress | 55% | fleet e2e + module tests | | **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests | | **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite | | **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B | Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.** --- ## 1. Vision & metaphor A **gigafactory** turns raw intent (`.md` task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network: | Term | Meaning | | ---- | ------- | | **Fleet** | The whole network of machines under one control plane. | | **Factory** | One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity. | | **Station** | A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent). | | **Worker** | A single running agent process executing one job at a station. | | **Job** | A unit of work: a prompt/`.md` + manifest (profile, scope, gates, budget). | | **Profile** | The *role* doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt **+** capability requirements. | | **Capability** | A tag a factory advertises and a job requires (`os:mac`, `has:xcode`, `has:figma`, `gpu`, `engine:devin`). | | **Lease** | A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery). | | **Gate** | A checkpoint a job must pass: auto-QA `verify`, human review, ship approval. | | **Artifact** | Any captured output: commits/PRs, logs, screenshots, reports, build outputs. | **North star:** drop work into one inbox (or file a tracker task), and the fleet figures out *where* (factory), *with what* (tool/engine), *as whom* (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship. ``` ┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐ │ plan/intake · roadmap · Fleet map · live logs · cost · approvals │ └───────────────▲───────────────────────────────────┬─────────────────────────┘ │ REST/SSE │ ┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐ │ queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos) │ └───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘ │ claim/lease/report │ │ │ ┌───────┴───────┐ ┌────────┴───────┐ ┌────────┴───────┐ ┌───────┴────────┐ │ FACTORY: mac │ │ FACTORY: ubuntu│ │FACTORY: windows│ │ FACTORY: mac-2 │ │ devin, claude │ │ codex, claude │ │ copilot, codex │ │ devin (xcode) │ │ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │ └───────────────┘ └────────────────┘ └────────────────┘ └────────────────┘ ``` --- ## 2. Current state (Phase 0 baseline — already shipped) Today's `agent-queue.sh` + `dashboard.mjs` (single host, zero-dep bash + Node): - **Folder kanban lifecycle:** `inbox → building → review → testing → shipped` (+ `failed`). - **Auto-QA gate:** agent rc=0 → `review/`; optional `verify:` runs in `cwd` → pass `testing/`, fail `failed/`; no verify → parks in `review/`. Manual `ship` = the human gate. - **Per-job frontmatter:** `engine` (devin/claude/codex), `cwd`, `yolo` (→ dangerous/auto-approve), `lock` (per-repo serialization), `timeout`, `verify`. - **Concurrency:** `AGENT_QUEUE_MAX` (default 3), per-`lock` serialization so same-repo jobs never collide. - **State & logs:** `.state/.meta` heartbeats + `logs/.log`; git-tracked queue (audit-by-commit). - **Interactive dashboard:** numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to `agent-queue.sh`. **Carries forward:** the `.md`-in-`inbox` UX, frontmatter contract, lifecycle stage names, `verify` gate, lock/affinity concept, the bash runner itself (becomes the factory agent). **Must change for the fleet:** single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane. - [x] Phase 0 complete — baseline shipped and self-tested. *(reference, not a work item)* --- ## 3. Goals & non-goals **Goals** - One intake, many machines: parallel execution across heterogeneous OS/tools. - Automatic routing to the best `factory × tool × profile` with affinity, fairness, budget, and health awareness. - Self-healing (lease expiry/requeue), quality gates, and full observability. - Reuse the ByteLyst stack (`platform-service`, Cosmos, `@bytelyst/*`, tracker-web) — no parallel infra. - Preserve offline/zero-dep edge operation via the bash runner. **Non-goals** - Not a CI/CD replacement (it *triggers* CI; CI still gates merges). - Not a general-purpose workflow engine (scoped to coding-agent execution). - Not a model/inference host (it orchestrates agent CLIs, doesn't serve models). - Not abandoning the simple `.md` mental model — humans still drop files / file tasks. --- ## 4. Core concepts contract (must hold across all phases) - [ ] Every job has a stable **id**, an immutable **manifest**, and an append-only **event log**. - [ ] Every Cosmos document carries `productId` (ByteLyst rule). - [x] A job in flight is always covered by exactly one **lease**; no live lease → reclaimable. - [x] **Atomic claim:** a job is assigned to exactly one worker via optimistic concurrency (Cosmos `_etag`/`If-Match` or a conditional `fleet_leases` insert keyed by `jobId`). Concurrent claimers — exactly one wins; losers retry the next candidate. - [x] **Fencing token:** every lease carries a monotonic `leaseEpoch`. Every report/commit/ship carries its epoch; the coordinator **rejects writes from a stale epoch**, so a partitioned or zombie worker cannot corrupt state after its lease was reclaimed. - [ ] **Coordinator-authoritative time:** all lease/TTL/SLA math uses server timestamps, never factory clocks (clock-skew safety). - [ ] Lifecycle stages are canonical and shared: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`). - [ ] The bash runner and the service speak the **same manifest + event vocabulary** (one schema, two transports). > **Implementation status (2026-05-29) — Phase 2 Foundation merged** (common-plat PR #28, `platform-service/src/modules/fleet/`): all 7 `fleet_*` containers (§13) ✓; repositories + coordinator (claim/lease/fence/heartbeat/reaper) ✓; idempotency + deps + submit-time cycle detection ✓; 50 module tests green. > **✓ P0 hardening landed (2026-05-29, common-plat PR #29) — atomic claim is now truly concurrency-safe.** Added `updateIfMatch` to `@bytelyst/datastore`: Cosmos conditions the replace on `_etag` via `accessCondition {type:'IfMatch'}` (412 → conflict) plus a rev compare for the pre-read window; the Memory provider does `get→compare→set` in one synchronous block (no `await` between), so concurrent callers cannot interleave. `fleet` `revUpdate*` now write conditionally. Proven by `Promise.all` 2-contender + N-claimer stress + concurrent `claimNextJob`/lease-renew tests (these **fail** on the old read-check-write, pass now). datastore 48 + fleet 53 green; full workspace build/test clean; no consumer regressed. **P2-S3 (factory integration) is now unblocked.** --- ## 5. The evolved Job manifest (feature) Extend today's frontmatter into a richer, **backward-compatible** manifest. Old `.md` files keep working (new fields optional with sane defaults). ```yaml --- # --- existing (unchanged) --- engine: devin # explicit engine; overrides profile/engine-class cwd: /abs/path/repo yolo: true lock: my-repo timeout: 45m verify: pnpm -s test # --- new --- profile: backend-engineer # role: persona + capability requirements engine-class: agentic-coder # abstract; scheduler picks a concrete engine if `engine` unset capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy prefers: [factory:mac-2] # soft routing hints (affinity) priority: high # critical|high|medium|low → SLA + preemption budget: { usd: 5, tokens: 2M, wall: 4h } # wall = HARD ceiling (always enforceable). usd/tokens = best-effort # caps: enforced only where the engine/provider exposes live metering; # otherwise estimated from provider usage APIs post-hoc + alerted. deps: [job-123, job-456] # DAG: don't start until these reach `shipped`/`testing` idempotency-key: nomgap-ux-2 # dedupe: a second identical submit is a no-op retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] } review-policy: manual # auto|manual|reviewers:[@alice] artifacts: [coverage, screenshots] # what to capture beyond commits tracker-item: ITEM-789 # link back to the originating tracker task --- ``` - [ ] Define the manifest schema (Zod in the service; documented YAML for `.md`). - [x] Backward-compat: a Phase-0 `.md` (only `engine/cwd/yolo`) parses with all new fields defaulted. *(P1-S1: bash runner; Zod schema still P2. selftest backward-compat case green.)* - [x] **Capability grammar** defined: tokens are `key` (presence, e.g. `has:xcode`), `key:value` (e.g. `os:mac`, `engine:devin`), or `keyversion` with `op ∈ {>=,>,=,<=,<}` (e.g. `node>=20`). `os:any` is a wildcard that matches every factory. A job matches a factory iff every required token is satisfied by the factory descriptor. *(P1-S1: `caps_match`/`detect_capabilities` in `agent-queue.sh`.)* - [x] **`engine-class` taxonomy** defined as an enum (`agentic-coder`, `chat-coder`, `review-only`) with a documented engine→class map (`devin,claude,codex → agentic-coder`; `copilot → chat-coder`). If `engine` is set it wins; else the scheduler picks any free engine in the class honoring `prefers-engine`. *(P1-S1: `resolve_engine`; `review-only` mapping reserved.)* - [x] **`idempotency-key` semantics:** `key + content-hash` identical ⇒ no-op (returns existing job). Same `key`, **different** content ⇒ **rejected with 409** unless the prior job is still `queued`/`blocked` (then it is superseded). A re-`run`/`retry` of an existing job is **not** a new submit and never trips dedupe. *(P1-S1: add-time dedupe; bash maps "409" → clear error, `queued` → still in `inbox/` ⇒ superseded.)* - [x] **`deps` semantics:** a dep is satisfied when it reaches `shipped` (default) or `testing` if `deps-mode: soft`. Submit-time **cycle detection** rejects cyclic graphs; unmet deps put the job in `blocked` (not `queued`). Cross-factory deps require the coordinator (P2); single-host deps work in P1. *(P1-S2: `deps_unmet` skip-with-reason in selection + `status` surfacing; `deps_would_cycle` on `add`. Cross-machine deps remain P2.)* - **Acceptance:** a manifest fixture suite parses/validates; invalid manifests fail with precise errors; capability-grammar + dep-cycle + idempotency-conflict cases covered. - **Verify gate:** schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases + grammar/cycle/409 cases). --- ## 6. Profiles — persona + capability (feature) A **profile** = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as `profiles/.md` (Phase 1) → Cosmos `profiles` container (Phase 2). ```yaml # profiles/backend-engineer.md --- name: backend-engineer persona: | You are a senior backend engineer. Favor minimal, well-tested changes... capabilities: [node>=20, has:pnpm] default-verify: pnpm -s typecheck && pnpm -s test engine-class: agentic-coder prefers-engine: [devin, claude] allowed-scope: ["backend/**", "packages/**"] # blast-radius guardrail review-policy: manual --- ``` - [x] Author starter catalog: `developer`, `backend-engineer`, `frontend-engineer`, `ux-designer`, `ui-designer`, `qa`, `reviewer`, `docs-writer`. *(P1-S2: `profiles/*.md` + a reserved `planner`.)* - [x] Persona overlay is **prepended** to the job body before the agent runs; secrets are never written to logs or the event stream (redaction at the source). *(P1-S2: `profile_persona` prepended to the stripped body file.)* - [x] Profile supplies default `verify`, `capabilities`, `engine-class`, `allowed-scope` when the job omits them. *(P1-S2: `fm_eff` — also `prefers-engine` + `review-policy`; job fields always override.)* - [ ] Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time). *(P2 — needs Cosmos snapshot at assign time.)* - [x] `allowed-scope` enforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check). *(P1-S2: `scope_check` post-run WARN-only + `scope_warning=` in meta; `path_in_scope` unit-testable.)* - **Acceptance:** a job with `profile: backend-engineer` and no `verify` inherits the profile's verify + persona. - **Verify gate:** profile-resolution unit tests; persona-injection golden test. --- ## 7. The scheduler / router (the heart) (feature) Given a `queued` job and the current fleet, choose `(factory, station/engine, profile)` and issue a lease. **Inputs:** job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters. **Algorithm (deterministic, explainable):** 1. **Filter** factories by **hard capability match** (job ∪ profile capabilities ⊆ factory capabilities) and free station for a compatible engine. 2. **Block** if `deps` unmet or `lock` already held → leave `queued`/`blocked`. 3. **Score** each candidate factory: `score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health − w6·starvationPenalty` 4. **Tie-break:** highest priority job first; then oldest; then lowest cost class. 5. **Assign atomically** → create the lease under an optimistic-concurrency guard (`_etag`/`If-Match` or conditional insert keyed by `jobId`) **with a fresh `leaseEpoch`**; on conflict another factory won → retry the next candidate. Set job `assigned`, decrement station/seat capacity, bump fairness counter. Use **coordinator-authoritative timestamps** only. 6. **Preemption (P3+):** a `critical` job may pause a `low` job at a needed station (checkpoint + requeue, bumping the preempted job's `leaseEpoch`). > **Phasing:** Phase 2 ships the deterministic **filter + atomic-assign core** (fixed weights). Phase 3 adds **tunable weights, preemption, and the explainability UI**. Phase 5 learns the weights (§14). - [ ] Implement pure, unit-testable scoring function (no I/O) with configurable weights. - [ ] Hard-filter correctness: never assign a job to a factory missing a required capability. - [ ] Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware). - [ ] Fairness: no factory or product starves under sustained load (counter + penalty). - [ ] Explainability: every assignment records *why* (matched caps, score breakdown) in the event log. - [ ] Determinism: same inputs → same decision (seeded tie-breaks) for testability. - [ ] Define **factory health** ∈ [0,1] = f(heartbeat freshness, recent run failure-rate, resource pressure); factories below a health floor are **filtered out**, not merely down-weighted. - [ ] **Station/seat capacity:** a factory's free stations = `min(host slots, per-engine seat limits)` (e.g. licensed Devin/Claude seats); the scheduler never over-subscribes a seat-limited engine. - [ ] **Distributed lock:** the Phase-0 local `lock` becomes a **coordinator-held lock** so same-`lock` jobs serialize across the whole fleet (prevents two factories pushing the same repo concurrently). - **Acceptance:** scenario fixtures (10+) produce expected assignments incl. starvation, capability-miss, seat-exhaustion, unhealthy-factory exclusion, and budget-exceed; a concurrent-claim race test proves exactly one winner. - **Verify gate:** router unit suite ≥ 95% branch coverage on the scoring/filter core; atomic-claim race test. --- ## 8. Factory model & registration (feature) Each machine runs a **factory agent** (the evolved `agent-queue` runner) that registers, heartbeats, claims jobs, and reports events. - [ ] **Capability auto-detection** at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values). - [ ] **Enrollment / bootstrap trust**: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a **scoped, rotatable factory token** (`jose` JWT); decommission = revoke. No standing shared secret in the queue. - [ ] **Registration**: `POST /fleet/factories` with descriptor → receives a factory id + token. - [ ] **Heartbeat**: periodic `PUT /fleet/factories/:id/heartbeat` (load, free stations, health). A **coordinator lease reaper** (not Cosmos TTL) sweeps `expiresAt < now` and reclaims, **bumping `leaseEpoch`** so the dead/zombie worker is fenced; a factory missing N heartbeats is marked `offline` and all its leases reclaimed. - [ ] **Claim loop**: `POST /fleet/leases/claim` advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + `leaseEpoch`. Use **claim backoff / long-poll** to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4. - [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`), **echoing `leaseEpoch`** (stale epoch → 409, worker self-aborts); renew lease while alive. - [ ] **Environment prep**: before `verify`, the factory ensures deps are installed (cold checkout → `pnpm install`); prep time counts against `budget.wall`. - [ ] **Graceful drain**: factory can stop claiming, finish in-flight, deregister. - **Acceptance:** a factory enrolls, claims a matching job, heartbeats, completes; a killed factory's job is reclaimed by another within the lease TTL and the killed worker's late report is **rejected by fencing**. - **Verify gate:** factory-agent integration test against a mock coordinator; crash-recovery + fencing-rejection test. --- ## 9. Coordination architecture (decision + path) Three transports were evaluated. **Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.** | Option | Pros | Cons | Verdict | | ------ | ---- | ---- | ------- | | (a) **Git-synced queue** (evolve folders) | zero infra, audit-by-commit, offline | weak/racey leasing, latency, merge churn | **Edge/offline only** | | (b) **Coordinator service** (platform-service module) | real leases, fairness, observability, reuses auth/Cosmos/productId | a service to run | **Chosen spine (P2)** | | (c) **Message broker** (NATS/Redis/SQS) | scale, backpressure, push dispatch | most moving parts/ops | **P4 when throughput demands** | - [ ] Document the decision + rationale in-repo (this section is the canonical record). - [ ] Define the **claim/lease protocol** once; both git-queue (poll) and service (API) implement it. - [ ] **Split-brain / network-partition safety:** a partitioned factory may keep running and even `git push`. `idempotency-key` dedupes *submits* but cannot undo *side-effects*. Mitigation: **fencing** — the coordinator rejects `ship`/merge reports from a stale `leaseEpoch`, and the distributed `lock` (§7) prevents a reclaimed-job's twin from pushing the same repo. Residual risk (a stale push to a feature branch) is contained by the PR-merge ship gate (§10) and surfaced for human triage. - [ ] **Offline-degrade**: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect; on reconnect it presents its `leaseEpoch` — if reclaimed, its results are quarantined, not auto-merged. - [ ] **Poll cost**: bound claim-loop RU via long-poll/backoff (§22); migrate to broker push at P4. - **Acceptance:** the same job manifest runs identically through the bash/git path and the service path; a simulated partition does not double-merge (fencing test). - **Verify gate:** contract test asserting protocol parity (git vs service) + partition/fencing test. --- ## 10. tracker-web / platform-service integration (committed path) **Layering:** tracker = *WHAT/WHY* (plan, intake, prioritize, roadmap, votes) · gigafactory = *HOW* (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real `tracker-service` model (`Item`: `type` bug/feature/**task**, `status` open/in_progress/done/closed/wont_fix, priority, labels, assignee, `source` incl. **auto_detected**, votes, comments, public roadmap) and the `tracker-web` `/api/tracker/[...path]` proxy pattern. ### Phase 1 — Adapter (no new infra) - [x] **task → job**: a tracker `Item` of `type: task` (e.g. `assignee: @agent` or label `agent:run`) is exported to a job `.md` (manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints). *(P1-S4: `aq from-tracker`; labels `engine-class:`/`profile:`/`priority:`/`cap:` → frontmatter.)* - [x] **job → tracker**: lifecycle events post back as **status updates + comments** — `building` → status `in_progress` + comment "started on factory X"; `shipped` → `done` + comment with commit SHAs / PR link / verify results; `failed` → comment with reason (status stays `in_progress` for human triage). *(P1-S4: `aq to-tracker` PATCHes status + posts a metrics-only comment; one-way echo §24.5; never fatal. The items API has no blocked/failed status, so failures map to `wont_fix` by default — override via `AQ_TRACKER_STATUS_FAILED`.)* - [x] Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash). *(P1-S4: derived `idempotency-key: tracker-` reuses Slice 1 dedupe; `to-tracker` is idempotent via `tracker_echoed`.)* - [x] Adapter is a thin script/CLI (`aq from-tracker ITEM-789`) + optional poller. *(P1-S4: `from-tracker`/`to-tracker` + opt-in `AQ_TRACKER_AUTO` auto-echo; a standalone poller is deferred.)* - **Acceptance:** filing a tracker task, marking it `agent:run`, results in a queued job; on ship, the item flips to `done` with a SHA comment. - **Verify gate:** adapter e2e against a tracker-service test instance (or mock); round-trip assertion. **Stage → tracker status mapping** (tracker's enum is coarser than the fleet's; keep fine-grained stage in a label + comment so no detail is lost): | Fleet stage | Tracker `status` | Extra | | ----------- | ---------------- | ----- | | `queued` / `assigned` / `blocked` | `in_progress` | label `fleet:` | | `building` / `review` / `testing` | `in_progress` | label `fleet:` + progress comment | | `shipped` | `done` | comment with SHA(s)/PR link/verify result | | `failed` / `dead_letter` | `in_progress` + label `needs-triage` | never auto-`closed`/`wont_fix` (humans decide) | **Ship semantics (PR flow):** `shipped` = change **merged to target branch with CI green** (default), OR `pr-opened` when `review-policy` defers merge to humans/CI — configurable per profile. This honors the non-goal that CI still gates merges (§3); the agent never bypasses branch protection. ### Phase 2 — Native spine - [ ] Stand up a `fleet` (a.k.a. `orchestrator`) module **inside platform-service**, sibling to `tracker-service`: pattern `types.ts → repository.ts → routes.ts`, ESM, Cosmos, `productId`, `req.log`. - [ ] Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats. - [ ] Runners (bash + any) become API clients of this module; tracker adapter calls it directly. - **Acceptance:** a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos. - **Verify gate:** module test suite (repository + routes) using the shared `@bytelyst/testing` inject helpers. ### Phase 3 — Unified control plane - [ ] Add a **Fleet** surface to `tracker-web` reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, **live log streaming**, lease/heartbeat status, cost burndown, approve/ship buttons. - [ ] **Streaming caveat (correctness):** live logs **must not** use the existing buffering catch-all proxy `/api/tracker/[...path]` — it does `res.text()` and would never stream. Use a **dedicated Next.js Route Handler returning a `ReadableStream` (SSE)** or a direct SSE/WebSocket to platform-service. Full logs are shipped to blob storage (§17); the endpoint serves stored tail + live append. - [ ] The Node TUI dashboard becomes a thin client of the same `/fleet` API (parity with web). - **Acceptance:** an operator can watch all factories + tail any job log + ship from the browser. - **Verify gate:** web e2e (Playwright) covering fleet map render, live log, and a ship action. --- ## 11. Lifecycle & gates at scale (feature) - [ ] Canonical stages enforced server-side: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`); transitions validated (illegal transition → 409). - [ ] Per-profile default `verify`; per-job override; verify runs at the factory, result reported as an event. - [ ] Human gates: `review-policy` routes to reviewers; multi-reviewer support (P3). - [x] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped. *(P1-S3 single-host stand-in: `failed/` `result=retries_exhausted`, WIP branch + full log preserved.)* - [ ] **Backpressure**: when no factory can take more, jobs stay `queued` (no thrash); SLA timers visible. - [ ] **Ship semantics** are profile-configurable (merged+green vs `pr-opened`, §10); `shipped` is terminal-success, `dead_letter` terminal-failure; `blocked` (unmet deps) is distinct from `queued`. - [x] **Retry vs idempotency**: a retry creates a new `fleet_runs` attempt under the same job/`idempotency-key` (never a duplicate job); backoff honored; `retry.on` filters which failure classes retry. *(P1-S3 single-host: `attempts` counter survives requeue; `backoff`→`next_eligible` gates selection; `on` filters timeout/verify_failed/crash.)* - **Acceptance:** a perpetually-failing job lands in `dead_letter` after configured retries; a passing one auto-advances to `testing` then waits for human `ship`; an illegal transition is rejected. - **Verify gate:** lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path). --- ## 12. Security, safety & governance (feature — critical with `yolo`/dangerous) - [ ] **Secret isolation**: creds live on each factory (env/keychain), **never** in the queue, manifest, logs, or Cosmos. Factory advertises *presence* of a cred capability, not the value. - [ ] **Scoped git tokens** per factory/repo; least-privilege; rotation documented. - [ ] **Push policy**: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag. - [ ] **Blast-radius guardrail**: enforce `allowed-scope` — pre-flight + post-run diff check; out-of-scope changes block the ship gate. - [ ] **Budget kill-switch**: exceed `budget` (usd/tokens/wall) → pause worker, alert, require human resume. - [ ] **Supply-chain safety**: edits to shared `@bytelyst/*` packages require `reviewer` profile + human gate (never auto-ship). - [ ] **Audit trail**: append-only event log per job (who/what/when/where/cost); immutable. - [ ] **Corp network/proxy**: honor `NETWORK`/proxy + truststore conventions on factories that need them. - [ ] **Kill switch (global)**: one command/flag halts all claiming fleet-wide (incident response). - **Acceptance:** a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test). - **Verify gate:** security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test. --- ## 13. Data model (Cosmos containers, P2+) Each container partitioned sensibly; every doc has `productId`. - [x] `fleet_jobs` (pk `/productId`) — manifest snapshot **+ the full instruction body verbatim as markdown (`bodyMd`)**, current stage, idempotency-key, tracker-item link, `checkpoint` pointer (WIP branch/commit). This is the **durable source of truth for instructions** — a factory holds only a transient materialized copy, so a machine going down loses nothing (§25). - [x] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, exit, verify result, **and execution insights: model, tokensIn/Out (+cached), cost (`estimated` flag), turns, tool-call counts, filesChanged, linesAdded/Deleted, attempt number** (§26). - [x] `fleet_leases` (pk `/jobId`) — holder factory, `expiresAt`, **`leaseEpoch` (fencing)**, renewals. **Reclaim via a coordinator reaper** that scans `expiresAt < now` — Cosmos TTL only garbage-collects stale rows, it **cannot trigger reclaim logic**. Claim guarded by `_etag`/`If-Match`. - [x] `fleet_factories` (pk `/productId`) — descriptor, capabilities, health, load, last heartbeat, seat limits. - [x] `fleet_profiles` (pk `/productId`) — versioned profile snapshots (immutable per version). - [x] `fleet_events` (pk `/jobId`) — append-only audit/event stream (stage changes, log pointers, cost ticks, scheduler decisions). - [ ] `fleet_artifacts` (pk `/jobId`) — pointers to **blob-stored** logs + artifacts (coverage, screenshots, build output). Large logs live in `@bytelyst/blob`, **never** inline in Cosmos (doc-size + RU limits). - [ ] Relate to existing tracker `Item` via `tracker-item` (no duplication of planning data). - [x] **Optimistic concurrency** (`_etag`) on every job stage transition + lease claim to prevent lost updates / double-assignment. *(PR #29: `updateIfMatch`.)* - [ ] **Indexing/RU**: the claim query is hot — index `stage`, `priority`, `capabilities`; avoid cross-partition fan-out; provision RU/s per §22. - **Acceptance:** repository CRUD + query tests per container; **atomic-claim race test (N concurrent claimers → exactly one wins)**; reaper-reclaim + fencing-rejection test; lease-expiry verified via reaper (not TTL). - **Verify gate:** repository unit/integration tests (memory + Cosmos provider via `DB_PROVIDER`). --- ## 14. Phased build roadmap (checklists) Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress. ### Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host) **Goal:** richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet. > **Slice progress — P1-S1:** manifest parsing (all §5 fields, defaulted + backward-compatible), `priority` ordering, capability detection+match gate, `engine-class` resolution, and `idempotency-key` dedupe are **done** on the bash runner. > > **Slice progress — P1-S3 (resilience & insights, single host):** crash recovery (`recover_orphans` + `aq recover`), git WIP checkpoint/resume (`aq/wip/`), functional `retry` policy (backoff + `retries_exhausted`), and execution insights (`parse_usage`, per-run metrics in meta, `aq insights`, `status`/`dash` insights) are **done** — see §11/§25/§26. > > **Slice progress — P1-S2 (profiles + deps/DAG, single host):** the `profiles/` catalog + resolution (`fm_eff` inheritance with job>profile>default precedence, persona injection), the warn-only `allowed-scope` guardrail (`scope_check`/`path_in_scope`), and single-host `deps` (block-with-reason in selection, `status` surfacing, submit-time cycle detection) are **done** — see §5/§6. > > **Slice progress — P1-S4 (tracker adapter, single host):** the task ↔ job round-trip is **done** (§10) — `aq from-tracker` materializes a job from a tracker Item (idempotent on `tracker-`, label→manifest mapping), `aq to-tracker` echoes status + a metrics-only comment one-way (idempotent via `tracker_echoed`, never fatal), and opt-in `AQ_TRACKER_AUTO` auto-echoes on transitions. All HTTP is curl-only through one wrapper (test seam `AQ_TRACKER_API_CMD`). **This closes the Phase-1 §14 tracker-adapter item.** Remaining P1 extras: `budget.wall` (P1-S3 left it) and Node-`dash` surfacing of the new fields. - [x] Extend `agent-queue.sh` frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible. *(P1-S1)* - [x] Add `profiles/` directory + profile resolution (persona injection, default verify/caps/scope) (§6). *(P1-S2)* - [x] Local capability detection + a job/factory capability match check before launch (§8 subset). *(P1-S1: `detect_capabilities` + `caps_match`; mismatch ⇒ `failed/` `result=capability_mismatch`, agent never launched.)* - [x] `priority` ordering in the inbox pick (replace pure FIFO with priority-then-age). *(P1-S1: `inbox_sorted`; per-lock serialization preserved.)* - [x] `deps` (DAG) blocking on a single host; `idempotency-key` dedupe on `add`. *(P1-S1 idempotency dedupe + P1-S2 `deps` blocking/cycle detection.)* - [ ] `retry` with backoff into `failed`/requeue; `budget.wall` enforced (extends `timeout`). *(P1-S3: `retry` with backoff + `retries_exhausted` DONE; `budget.wall` still pending.)* - [x] `allowed-scope` guardrail (warn-only this phase) + post-run diff report. *(P1-S2: `scope_check` WARN-only + `scope_warning=`.)* - [x] **Tracker adapter** `aq from-tracker ` + `aq to-tracker` event poster (§10 P1). *(P1-S4: curl-only `tracker_api`; from-tracker materializes a job (idempotent), to-tracker echoes status+metrics one-way; opt-in `AQ_TRACKER_AUTO`. A standalone background poller is deferred to P2.)* - [ ] Dashboard shows profile + priority + capability tags + tracker-item link. *(P1-S1: `status` shows priority/profile/caps/tracker-item; P1-S4: status/insights also show last echoed tracker status; Node `dash` surfacing pending.)* - [x] Update `selftest.sh` with: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock). *(P1-S1 manifest/priority/idempotency + P1-S2 profile/persona/scope/dep-block/cycle + P1-S3 resilience/insights + P1-S4 tracker from/to round-trip via stub.)* - [x] Update README + this doc's progress table. *(P1-S1)* - **Exit criteria:** all boxes ✅; `selftest.sh` green; a tracker task → executed → tracker `done` with SHA comment, fully on one host; no regression to Phase-0 `.md` files. ### Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing **Goal:** the service spine; ≥2 real factories executing in parallel via leases. > **Slice progress — P2-S3 (factory-agent integration, single host):** the bash runner > is now a coordinator **factory** behind `AQ_FLEET` — `lib/fleet-client.sh` (curl-only, > sourced) registers via heartbeat, claims jobs into inbox (interleaved with local `.md`), > reports **fenced** stage transitions with WIP checkpoints, renews/releases leases, and on > a stale `leaseEpoch` (reclaimed) **self-aborts + quarantines** the local result. Coordinator > 5xx/connection errors **degrade** (finish locally) rather than abandon work. When `AQ_FLEET` > is off the offline git-queue path is byte-for-byte unchanged. Remaining P2: scheduler/router > core, direct tracker→module calls, factory enrollment + scoped tokens, `fleet.*` feature > flags + shadow/dual-run, and the two-factory parallel demo (the Phase-2 exit criteria). - [x] Scaffold `fleet`/`orchestrator` module in `platform-service` (`types/repository/routes`, Zod, ESM, `productId`). *(PR #28)* - [x] Cosmos containers (§13) + repository layer (memory + Cosmos providers). *(PR #28; `fleet_artifacts` blob wiring still pending.)* - [x] **Atomic claim** (optimistic concurrency / `_etag`) + **lease reaper** + **fencing (`leaseEpoch`)** endpoints (§4/§8/§9) — *not* Cosmos-TTL-driven reclaim. *(common-plat PR #28 + #29; truly atomic via `updateIfMatch`.)* - [x] Port `agent-queue` runner to a **factory agent** API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback. *(P2-S3: `lib/fleet-client.sh` behind `AQ_FLEET`; registers via heartbeat, claims into inbox, reports fenced stage transitions, renews leases, quarantines on stale-epoch; offline git-queue unchanged when the flag is off.)* - [ ] Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment. - [ ] Tracker adapter calls the module directly (not just file export). - [ ] Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset). - [ ] **Feature flags** (`fleet.enabled`, `fleet.route_via_service`) + **shadow/dual-run** vs P1 before cutover (§21). - [x] Module test suite (repository + routes via `@bytelyst/testing`); **atomic-claim race**, crash-recovery, fencing-rejection, reaper-reclaim tests. *(PR #28 + #29: 53 fleet + 48 datastore tests, incl. true-concurrency claim.)* - [ ] Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end. - **Exit criteria:** all boxes ✅; `pnpm --filter @lysnrai/platform-service test` green; killing a factory mid-job → another reclaims and completes **and the dead worker's late report is fenced**; concurrent claimers never double-assign; all state in Cosmos with `productId`; **flag-off rollback verified** (§21). ### Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router **Goal:** one browser control plane; smart routing + budgets live. - [ ] `fleet` API client in `tracker-web` (reuse `/api/tracker`-style proxy → `/fleet`). - [ ] Fleet map page (factories, load, health, capabilities) on `@bytelyst/*` components. - [ ] Job table + job detail + **DAG view**; live log via **SSE**; approve/ship/reject/requeue actions. - [ ] Cost burndown + budget kill-switch UI; multi-reviewer routing. - [ ] Scoring router with configurable weights + explainability surfaced in UI. - [ ] Preemption of low-priority by critical jobs (checkpoint + requeue). - [ ] TUI dashboard re-pointed at `/fleet` API (parity). - [ ] Web e2e (Playwright): fleet map, live log, ship, budget-pause. - **Exit criteria:** all boxes ✅; web `verify` (typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume. ### Phase 4 — Message bus + autoscaling + cross-OS capability marketplace **Goal:** scale-out and elasticity. - [ ] Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability. - [ ] Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA. - [ ] Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products. - [ ] Load + chaos test suite (factory churn, broker outage, thundering herd). - **Exit criteria:** all boxes ✅; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll). ### Phase 5 — Self-optimizing / learned routing **Goal:** the scheduler learns from history to cut time/cost and raise first-pass success. - [ ] Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate). - [ ] Offline eval harness comparing learned vs heuristic routing on historical data. - [ ] Shadow/A-B rollout with guardrails; auto-tune scoring weights. - [ ] Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper"). - **Exit criteria:** all boxes ✅; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric. --- ## 15. Cross-cutting feature catalog (quick index) | Feature | First phase | Section | | ------- | ----------- | ------- | | Evolved job manifest | P1 | §5 | | Profiles (persona + capability) | P1 | §6 | | Capability matching | P1→P2 | §6/§8 | | Priority + SLA | P1 | §5/§7 | | DAG dependencies | P1→P3 | §5/§11 | | Idempotency / dedupe | P1 | §5 | | Retry + dead-letter | P1→P2 | §11 | | Budgets + kill-switch | P1(wall)→P3 | §5/§12 | | Scheduler/router scoring | P2→P3 | §7 | | Factory registration/heartbeat/lease | P2 | §8 | | Coordinator (platform-service module) | P2 | §9/§10 | | Cosmos data model | P2 | §13 | | Tracker bi-directional sync | P1→P2 | §10 | | Web control plane + SSE logs | P3 | §10/§17 | | Security/scope/secret isolation | P1→P2 | §12 | | Broker + autoscaling | P4 | §14 | | Learned routing | P5 | §14 | | Atomic claim + fencing + distributed lock | P2 | §4/§7/§9 | | Rollout / rollback / feature flags | P2→ | §21 | | Capacity planning & RU/cost | P2→ | §22 | | Ownership & RACI / on-call | all | §23 | | Work hierarchy & composite delegation (roadmap/epic) | P3 (manual) → P5 (planner) | §24 | | Durability, crash recovery & work preservation | P1 (orphan/retry/WIP) → P2 (lease/resume) | §25 | | Execution insights & token accounting | P1 (capture) → P3 (rollup UI) | §26 | --- ## 16. Definition of Done — the "100% accuracy" rubric A feature/phase is **not done** until **every** item below is true (this is the bar for "100% end-to-end"): - [ ] **Functionality**: acceptance criteria met; happy path + documented edge cases handled. - [ ] **Tests**: unit + integration written *first or alongside*, all green; no weakened/deleted tests; coverage targets met (router ≥95% core). - [ ] **Verify gate**: the phase's named gate command passes locally (and in CI where applicable). - [ ] **Idempotency & recovery**: re-runs are safe; crash mid-step recovers (lease/idempotency). - [ ] **Security review**: secret-leak scan clean; scope guardrail honored; least-privilege tokens. - [ ] **Observability**: events/logs/metrics emitted; failures are diagnosable from the control plane. - [ ] **Docs**: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current. - [ ] **Backward-compat**: existing `.md`/Phase-0 behavior unbroken (regression check). - [ ] **Drift checks**: shared-infra templates (`.npmrc`, `docker-prep`) untouched/synced; conventional commits. - [ ] **No `console.log`/`print`** in service code; `req.log`/`os.Logger` used; ESM `.js` imports. --- ## 17. Observability & control plane details - [ ] **Log transport/storage**: factory ships logs to blob (`@bytelyst/blob`); `fleet_events` carries pointers + a recent-tail buffer. The control plane serves stored tail + live append (via the streaming route, **not** the buffering proxy — §10). - [ ] **Live logs** via SSE (single stream contract) from the streaming endpoint to web/TUI. - [ ] **Metrics**: queue depth, `blocked` count, assign latency, claim-loop RU/s, run duration, verify pass-rate, cost, factory utilization, fairness, reclaim/fencing-rejection counts. - [ ] **Alerting**: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter, **claim-race anomalies**, RU throttling (Cosmos 429s). - [ ] **Tracing**: a job's full timeline (queued→…→shipped) reconstructable from `fleet_events` (immutable, ordered). - [ ] **Cost burndown** per job/product/day with budget overlays. - [ ] **SLOs defined + dashboarded** (see §19 targets); error budget tracked per SLO. --- ## 18. Risks & gaps explicitly tracked (expert call-outs) - [ ] **Duplicate execution** across transports (git fallback + service) — `idempotency-key` (submit) + atomic lease (assign) + **fencing token** (side-effect) + distributed `lock` (push). - [ ] **Crash recovery** — coordinator **lease reaper + fencing** (not Cosmos TTL); checkpoint long jobs where engines allow. - [ ] **Split-brain / partition** — fencing rejects stale `leaseEpoch` writes; reclaimed-job results quarantined, not auto-merged (§9). - [ ] **Shared-package conflicts** — two jobs editing `@bytelyst/*` simultaneously → fleet-wide `lock` + reviewer gate. - [ ] **Starvation/fairness** — per-product + per-factory counters with penalty. - [ ] **Cost runaway** — `budget.wall` hard ceiling everywhere; `usd`/`tokens` best-effort (provider metering) + global kill switch. - [ ] **Cosmos RU throttling (429)** — hot claim path; bound via long-poll/backoff + indexing (§13/§22); broker offload at P4. - [ ] **Clock skew** — coordinator-authoritative timestamps for all lease/SLA math (§4). - [ ] **Tool-version drift / reproducibility** — record engine + tool versions per run; pin where possible. - [ ] **Windows quirks** — path/shell differences in the factory agent; capability-gate Windows-only work. - [ ] **Human-review bottleneck** — auto-verify as much as possible; batch review UI; reviewer routing. - [ ] **Result capture beyond commits** — artifacts (coverage, screenshots, build logs) attached to runs. - [ ] **Secret sprawl** — never in queue/manifest/logs/Cosmos; presence-only capabilities. - [ ] **Data retention** — event/log retention + archival policy (extend today's `clean`). - [ ] **Engine API churn** — engines mapped in one place (`build_agent_cmd`); capability matrix versioned. --- ## 19. Success metrics Each metric has a **provisional SLO target** (tune with real data; tracked with an error budget): | Dimension | Metric | Provisional SLO target | | --------- | ------ | ---------------------- | | Throughput | jobs shipped/day; parallel utilization | utilization ≥ 60% under backlog | | Quality | % auto-verified; first-pass success; escaped-defect; post-agent human-edit rate | first-pass ≥ 70%; escaped-defect < 2% | | Speed | assign latency; time queued→shipped (excl. human gate) | assign p95 < 5s; queue-wait p95 < 2m at target load | | Cost | $/shipped job; budget-breach rate | budget-breach < 1% of jobs | | Reliability | lease-reclaim success; dead-letter rate; factory uptime; double-execution incidents | reclaim success ≥ 99.9%; **double-merge = 0**; dead-letter < 5% | | Fairness | max/min product wait-time ratio | ratio < 3× | | Correctness | atomic-claim violations; fencing rejections functioning | claim violations = 0 | > Targets are starting points; the §0 owners ratify per-phase SLOs before that phase's exit. --- ## 20. Open questions - [ ] Copilot headless feasibility as an engine/station (CLI/automation surface?). - [ ] Who owns merge/push authority — agents open PRs only, or auto-merge on green for low-risk profiles? - [ ] Multi-user/tenant: per-user queues + RBAC in the control plane? - [ ] On-call/ownership for the fleet (alerts routing, runbooks)? - [ ] Cloud factory provisioning (Phase 4) — which provider/runtime, cost guardrails? - [ ] Profile authorship/governance — who can create/edit profiles, and review of persona prompts? --- ## 21. Rollout, rollback & data migration Each phase ships behind controls so it can be turned off without losing work. - [ ] **Feature-flagged rollout**: gate each phase's new path behind a platform feature flag (`fleet.enabled`, `fleet.route_via_service`, `fleet.tracker_sync`); default off; enable per-product first. - [ ] **Dual-run / shadow**: P2 coordinator runs in shadow (assign decisions logged, not executed) alongside the P0/P1 path before cutover; compare decisions. - [ ] **Cutover is reversible**: a factory can fall back from service-claim to git-queue via flag; no schema-destructive step on the rollback path. - [ ] **Data migration**: introducing Cosmos containers (P2) is **additive** — no migration of existing tracker data; backfill is read-only (link `tracker-item`, don't mutate). Container creation is idempotent (registered in `cosmos-init`). - [ ] **Backward-compat gate**: every phase re-runs Phase-0 `selftest.sh` + a corpus of legacy `.md` files (regression). - [ ] **Rollback drill**: each phase's exit includes a tested rollback (flag off → prior behavior, in-flight jobs drain or requeue cleanly). - **Acceptance:** flipping `fleet.*` flags off returns the system to the prior phase's behavior with zero data loss; in-flight jobs either complete or requeue. - **Verify gate:** rollout/rollback drill documented + a flag-off regression run is green. --- ## 22. Capacity planning & cost - [ ] **Concurrency model**: fleet throughput = Σ factory free-stations, bounded by per-engine **seat limits** (e.g. N Devin seats) — document seat inventory per engine before P2. - [ ] **Cosmos RU budgeting**: the claim/heartbeat paths are the hot loops. Estimate RU/s = (factories × claim-poll rate × query RU) + (factories × heartbeat rate × upsert RU); pick **long-poll interval** to keep steady-state RU within a provisioned budget; enable autoscale RU with a ceiling + 429 alerting. - [ ] **Polling vs push**: at F factories the poll RU grows linearly — define the F threshold that triggers the P4 broker migration. - [ ] **Blob storage**: logs/artifacts sizing + lifecycle (hot → cool → delete) per retention policy (§18). - [ ] **Factory sizing**: per-OS resource baseline (CPU/RAM/disk for N concurrent agent sessions + warm checkouts); disk pressure as a health input. - [ ] **Cost guardrails**: per-product spend caps + alerts; ties to `budget` and the global kill-switch. - **Acceptance:** a documented capacity sheet (seats, RU/s, blob GB, factory specs) sized for the target steady-state + 2× burst. - **Verify gate:** load test sustains target throughput within the RU/cost budget (no 429 storms). --- ## 23. Ownership & RACI Owners are roles, not names — assign before each phase starts (this removes the "undefined owner" gap). | Area | Responsible (R) | Accountable (A) | Consulted (C) | Informed (I) | | ---- | --------------- | --------------- | ------------- | ------------ | | Runner / factory agent (bash) | DevOps eng | Platform lead | — | All | | Coordinator module (platform-service) | Backend eng | Platform lead | Security | All | | Scheduler/router | Distributed-systems eng | Platform lead | Backend | All | | Control plane (tracker-web Fleet) | Frontend eng | Platform lead | UX | All | | Security/governance | Security eng | Security lead | Platform | All | | Capacity/cost & SLOs | SRE | Platform lead | Finance | All | | Profiles & persona governance | Eng leads | Platform lead | — | All | - [ ] Each phase names its R/A before kickoff; SLOs (§19) ratified by A. - [ ] On-call + runbooks established before the fleet runs unattended `yolo` workloads (Phase 2+). --- ## 24. Work hierarchy & composite delegation (roadmap / epic) **Goal:** delegate work at *any* granularity — a single bug/feature/task, **or an entire roadmap** — and let the fleet decompose + orchestrate rather than hand a multi-day roadmap to one agent session (which is long-horizon, low first-pass-success, and high blast-radius under `yolo`). ### 24.1 Two delegation modes - **Atomic** (today's model): one leaf item (`bug`/`feature`/`task`) → one job → one agent at one station. - **Composite** (new): a `roadmap`/`epic` → a **planner** profile expands it into child jobs → the scheduler runs them as a **DAG across factories/agents/profiles**, honoring `deps` + phase gates. "Delegate the whole roadmap" = hand it to the **orchestrator**, which fans out — never one agent grinding for hours. ### 24.2 Job `kind` — the one genuinely new concept A new axis, **orthogonal to tracker `type`**: - **`kind: leaf`** — runs an engine at a station (everything Phase 1–2 already does). - **`kind: composite`** — runs the **planner/orchestrator** that emits child `leaf` jobs and a dependency graph; it never itself edits a repo. The scheduler (§7) routes by `kind`: `leaf` → station/engine; `composite` → planner. This keeps execution and planning cleanly separated. ### 24.3 Hierarchy & relationships - [ ] `parentId` links a child job/item to its roadmap/epic; `deps` (§5) expresses ordering within it (DAG, submit-time cycle detection). - [ ] A roadmap is, mechanically, a **named DAG of jobs + a rollup** — it reuses `deps`, profiles (§6), the scheduler (§7), and the lifecycle (§11); the only additions are `kind`, `parentId`, and rollup logic. - [ ] Add a **`planner`/`architect`/`tech-lead` profile** (§6 catalog) for decomposition + orchestration; leaf work still uses `backend-engineer`, `ux-designer`, etc. ### 24.4 Rollup semantics (composite-level) - [ ] **Status rollup:** roadmap `status` is derived from children — `in_progress` once any child starts; `shipped`/`done` only when **all** children reach `shipped`; surfaces `blocked`/`failed` children for triage. - [ ] **Budget rollup:** roadmap `budget` = Σ child budgets with an explicit **ceiling**; breaching the ceiling pauses fan-out (ties to §12 kill-switch). - [ ] **Verify rollup:** each leaf runs its own `verify`; the roadmap's acceptance gate runs **after** all leaves pass (e.g. an integration/e2e gate). - [ ] **Phase gates:** the roadmap's own phase Exit-criteria become **runtime gates** — fan-out of phase N+1 is blocked until phase N's children ship; human approval between phases is the default for `yolo` safety. - [ ] **Idempotent re-run:** re-running a roadmap **skips already-`shipped` children** (content-hash dedupe, §5); only unfinished/changed children re-queue. ### 24.5 Source-of-truth & sync (no drift) Composite work obeys the same SoT discipline as the core contract (§4 immutable manifest) and the tracker echo (§10): a roadmap/epic is **one record referenced by many**, never duplicated. - [ ] The **roadmap/epic** is the SoT for *what/why + rollup status*; each **leaf job/run** is the SoT for *its* execution. - [ ] Children reference the parent by `parentId`; the planner writes the child set **once** at decomposition (immutable manifest snapshot). Re-planning creates a new revision, it does not mutate in-flight children. - [ ] Status flows **one way, child → parent → tracker** (the §10 echo); humans never hand-edit rollup state. ### 24.6 Decision — **Hybrid** (recorded) > Model composite delegation in the **fleet layer now**; defer the shared-platform enum change until proven. - **Now (fleet-owned):** add `kind` (`leaf`/`composite`), `parentId`, and rollup to the `fleet_jobs` schema (§13). The fleet owns this schema outright — no cross-product risk. - **Tracker stays `bug`/`feature`/`task`** (the shared `ITEM_TYPES` used by all 9 products is unchanged). A roadmap is represented by a **parent item + label `kind:roadmap`** + `parentId` on children — zero platform migration, no sign-off needed. - **Later (optional, gated on proven value):** promote `kind:roadmap` → a first-class `epic` tracker `type` via an **additive migration** (backfill items where `labels` contains `kind:roadmap` into `type: epic`, keep the label as an alias during transition). Low-risk because the behavior already works fleet-side. - **Rationale:** avoids a speculative 9-product platform change (UI/filters/stats/tests) before the orchestration model is validated; if the model is wrong, only fleet code is refactored, not a platform enum every product depends on. ### 24.7 Phasing & gates - **P1–P2:** leaf-only (no composite); `kind` defaults to `leaf`. - **P3:** composite scheduling + rollup + DAG view in the control plane, with **manual decomposition** (a human/author defines the child set). - **P3→P5:** the **auto-decomposition planner agent** (itself a `composite` job run by the `planner` profile) — start manual, automate once trustworthy. - **Acceptance:** a roadmap with N child jobs fans out across ≥2 factories, respects `deps` + phase gates, rolls up status/budget correctly, and a re-run skips shipped children; tracker shows the parent moving `in_progress → done` via the one-way echo. - **Verify gate:** composite-orchestration tests — DAG expansion, rollup status/budget, phase-gate blocking, idempotent re-run; control-plane e2e for the roadmap DAG view. --- ## 25. Durability, crash recovery & work preservation **Goal:** a machine power-off, daemon/agent crash, or network partition **never loses the job, its instructions, or in-progress work**, and never corrupts state. Recovery is automatic and idempotent. ### 25.1 Instructions are durable (markdown in Cosmos) - [ ] The **full job instruction body is persisted verbatim as markdown** in `fleet_jobs.bodyMd` (§13), alongside the structured manifest. The originating tracker `Item.description` also retains the human instruction text; the two are linked by `tracker-item`, never duplicated as competing truth (§24.5). - [ ] A factory only ever holds a **transient materialized copy** (temp prompt file) fetched from the API — losing the factory loses nothing. On the offline edge, the `.md` file on disk is the durable copy and reconciles on reconnect (§9). ### 25.2 Work-in-progress is preserved (checkpointing) - [x] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/`** at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to `main`/protected branches (§12 push policy). *(P1-S3: `_wip_start`/`_wip_checkpoint` + EXIT/INT/TERM trap; non-git cwd skipped.)* - [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it. *(P2 Cosmos; single-host records `wip_branch`/`wip_base`/`wip_commit` in `.meta`.)* - [x] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window. *(P1-S3: start + every-exit-path commits bound the loss window.)* ### 25.3 Recovery is automatic, resumable, and fenced - [x] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded. *(P1-S3: `recover_orphans` on `run` startup + each loop, and `agent-queue.sh recover`; dead-pid + `pidstart` reuse guard.)* - [x] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/` exists, the new worker **resumes from the checkpoint** instead of restarting from zero. *(P1-S3: relaunch checks out `aq/wip/`; `attempts` incremented.)* - [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of *visible* outcomes. *(P2 — distributed leasing; out of single-host scope.)* - [x] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion → `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics — never silently dropped. *(P1-S3 single-host.)* - [x] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** (§13); recovery is idempotent (running it twice yields one recovery). *(P1-S3 single-host: meta is append-only + re-derivable from folder location; `_etag` guard is P2.)* ### 25.4 Crash taxonomy (all handled) | Failure | Detection | Recovery | | ------- | --------- | -------- | | Agent process crash (`rc≠0`) | exit code | retry policy → requeue or `failed`/`dead_letter` | | Daemon/runner crash | lease not renewed | reaper reclaims → resume from checkpoint | | Machine power-off / partition | missed heartbeats + lease expiry | reaper + fencing + WIP resume elsewhere | | Coordinator restart | state in Cosmos | leases survive; in-flight reconciled on boot | - **Acceptance:** SIGKILL an agent and power-off a factory mid-run → another worker **resumes from the last checkpoint (not from zero)** and ships; instructions intact (read back from Cosmos `bodyMd`); **zero duplicate commits/merges**; a retry-exhausted job lands in `dead_letter`/`failed` with diagnostics. - **Verify gate:** chaos tests — kill agent, kill runner, simulate partition; assert resume-from-checkpoint, fencing rejection of the stale worker, instruction integrity, and no double-merge. --- ## 26. Execution insights & token accounting **Goal:** per-job/run visibility into **token usage, cost, model, latency, and tool activity** — to drive budgets (§5/§12), cost burndown (§17), and learned routing (§14 P5). - [x] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries. *(P1-S3 single-host: recorded in `.meta` — `duration_s`, `files_changed`/`lines_added`/`lines_deleted`, tokens/cost/turns/tool_calls, `attempts`; CPU time not captured.)* - [x] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` — same caveat as `budget.usd/tokens` (§5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction. *(P1-S3: `parse_usage` adapter; generic `AQ_USAGE` line + Claude/Codex heuristics; Devin/Copilot TODO; `usage_estimated` flag, never fabricated.)* - [ ] **Aggregation/rollups:** per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14). *(P1-S3 partial: `aq insights` does per-job + per-engine rollup; product/factory/profile/day are P2/P3.)* - [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present. *(P1-S3 partial: edge CLI `aq insights` + `status`/`dash` insights line done; web control-plane panels are P3.)* - [x] **Privacy:** telemetry carries metrics + pointers only — **never prompt content or secrets** (redaction §12). *(P1-S3: insights/meta record only metrics; no prompt body or secrets added.)* - **Acceptance:** after a run, its `fleet_runs` carries token/cost/duration/tool/diff metrics (real where metered, flagged `estimated` otherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone. - **Verify gate:** telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified. --- *This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.*