bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md
Saravanakumar D f1fe66fd4d docs(roadmap): tick verified-done Phase 3 boxes (395-400,402)
Phase 3 fleet control plane is implemented in learning_ai_common_plat:
fleet API client, fleet map page, job table/detail/DAG/SSE/actions, cost
burndown + multi-reviewer gate, scoring explainability, preemption, and
Playwright fleet e2e. Box 401 (TUI re-point) remains open.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:13:25 -07:00

680 lines
67 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Gigafactory — Vision & Implementation Roadmap
> **One-liner:** Evolve today's single-host `agent-queue` bash runner into a distributed **gigafactory** — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler **auto-picks jobs from a shared inbox and routes each `.md` to the best factory × tool × profile** — built service-side on `platform-service` + `tracker-web`, with the bash runner surviving as the offline edge agent.
> **How to use this doc:** It is both a PRD and an execution checklist. Every feature is a `- [ ]` checkbox with **acceptance criteria** and a **verify gate**. A phase is "100% done" only when every box is checked, its gate passes, and the phase **Definition of Done** rubric (§16) is green. Update the progress table (§0) as you go.
---
## 0. Progress tracker
| Phase | Theme | Status | % | Gate |
| ----- | ----- | ------ | - | ---- |
| **0** | Baseline (today) | ✅ shipped | 100% | `selftest.sh` green |
| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 95% | adapter e2e + selftest |
| **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ◐ in progress | 80% | fleet e2e + module tests |
| **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests |
| **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite |
| **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B |
Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. **Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.**
---
## 1. Vision & metaphor
A **gigafactory** turns raw intent (`.md` task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network:
| Term | Meaning |
| ---- | ------- |
| **Fleet** | The whole network of machines under one control plane. |
| **Factory** | One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity. |
| **Station** | A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent). |
| **Worker** | A single running agent process executing one job at a station. |
| **Job** | A unit of work: a prompt/`.md` + manifest (profile, scope, gates, budget). |
| **Profile** | The *role* doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt **+** capability requirements. |
| **Capability** | A tag a factory advertises and a job requires (`os:mac`, `has:xcode`, `has:figma`, `gpu`, `engine:devin`). |
| **Lease** | A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery). |
| **Gate** | A checkpoint a job must pass: auto-QA `verify`, human review, ship approval. |
| **Artifact** | Any captured output: commits/PRs, logs, screenshots, reports, build outputs. |
**North star:** drop work into one inbox (or file a tracker task), and the fleet figures out *where* (factory), *with what* (tool/engine), *as whom* (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship.
```
┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐
│ plan/intake · roadmap · Fleet map · live logs · cost · approvals │
└───────────────▲───────────────────────────────────┬─────────────────────────┘
│ REST/SSE │
┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐
│ queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos) │
└───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘
│ claim/lease/report │ │ │
┌───────┴───────┐ ┌────────┴───────┐ ┌────────┴───────┐ ┌───────┴────────┐
│ FACTORY: mac │ │ FACTORY: ubuntu│ │FACTORY: windows│ │ FACTORY: mac-2 │
│ devin, claude │ │ codex, claude │ │ copilot, codex │ │ devin (xcode) │
│ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │
└───────────────┘ └────────────────┘ └────────────────┘ └────────────────┘
```
---
## 2. Current state (Phase 0 baseline — already shipped)
Today's `agent-queue.sh` + `dashboard.mjs` (single host, zero-dep bash + Node):
- **Folder kanban lifecycle:** `inbox → building → review → testing → shipped` (+ `failed`).
- **Auto-QA gate:** agent rc=0 → `review/`; optional `verify:` runs in `cwd` → pass `testing/`, fail `failed/`; no verify → parks in `review/`. Manual `ship` = the human gate.
- **Per-job frontmatter:** `engine` (devin/claude/codex), `cwd`, `yolo` (→ dangerous/auto-approve), `lock` (per-repo serialization), `timeout`, `verify`.
- **Concurrency:** `AGENT_QUEUE_MAX` (default 3), per-`lock` serialization so same-repo jobs never collide.
- **State & logs:** `.state/<job>.meta` heartbeats + `logs/<job>.log`; git-tracked queue (audit-by-commit).
- **Interactive dashboard:** numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to `agent-queue.sh`.
**Carries forward:** the `.md`-in-`inbox` UX, frontmatter contract, lifecycle stage names, `verify` gate, lock/affinity concept, the bash runner itself (becomes the factory agent).
**Must change for the fleet:** single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane.
- [x] Phase 0 complete — baseline shipped and self-tested. *(reference, not a work item)*
---
## 3. Goals & non-goals
**Goals**
- One intake, many machines: parallel execution across heterogeneous OS/tools.
- Automatic routing to the best `factory × tool × profile` with affinity, fairness, budget, and health awareness.
- Self-healing (lease expiry/requeue), quality gates, and full observability.
- Reuse the ByteLyst stack (`platform-service`, Cosmos, `@bytelyst/*`, tracker-web) — no parallel infra.
- Preserve offline/zero-dep edge operation via the bash runner.
**Non-goals**
- Not a CI/CD replacement (it *triggers* CI; CI still gates merges).
- Not a general-purpose workflow engine (scoped to coding-agent execution).
- Not a model/inference host (it orchestrates agent CLIs, doesn't serve models).
- Not abandoning the simple `.md` mental model — humans still drop files / file tasks.
---
## 4. Core concepts contract (must hold across all phases)
- [ ] Every job has a stable **id**, an immutable **manifest**, and an append-only **event log**.
- [ ] Every Cosmos document carries `productId` (ByteLyst rule).
- [x] A job in flight is always covered by exactly one **lease**; no live lease → reclaimable.
- [x] **Atomic claim:** a job is assigned to exactly one worker via optimistic concurrency (Cosmos `_etag`/`If-Match` or a conditional `fleet_leases` insert keyed by `jobId`). Concurrent claimers — exactly one wins; losers retry the next candidate.
- [x] **Fencing token:** every lease carries a monotonic `leaseEpoch`. Every report/commit/ship carries its epoch; the coordinator **rejects writes from a stale epoch**, so a partitioned or zombie worker cannot corrupt state after its lease was reclaimed.
- [ ] **Coordinator-authoritative time:** all lease/TTL/SLA math uses server timestamps, never factory clocks (clock-skew safety).
- [ ] Lifecycle stages are canonical and shared: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`).
- [ ] The bash runner and the service speak the **same manifest + event vocabulary** (one schema, two transports).
> **Implementation status (2026-05-29) — Phase 2 Foundation merged** (common-plat PR #28, `platform-service/src/modules/fleet/`): all 7 `fleet_*` containers (§13) ✓; repositories + coordinator (claim/lease/fence/heartbeat/reaper) ✓; idempotency + deps + submit-time cycle detection ✓; 50 module tests green.
> **✓ P0 hardening landed (2026-05-29, common-plat PR #29) — atomic claim is now truly concurrency-safe.** Added `updateIfMatch` to `@bytelyst/datastore`: Cosmos conditions the replace on `_etag` via `accessCondition {type:'IfMatch'}` (412 → conflict) plus a rev compare for the pre-read window; the Memory provider does `get→compare→set` in one synchronous block (no `await` between), so concurrent callers cannot interleave. `fleet` `revUpdate*` now write conditionally. Proven by `Promise.all` 2-contender + N-claimer stress + concurrent `claimNextJob`/lease-renew tests (these **fail** on the old read-check-write, pass now). datastore 48 + fleet 53 green; full workspace build/test clean; no consumer regressed. **P2-S3 (factory integration) is now unblocked.**
---
## 5. The evolved Job manifest (feature)
Extend today's frontmatter into a richer, **backward-compatible** manifest. Old `.md` files keep working (new fields optional with sane defaults).
```yaml
---
# --- existing (unchanged) ---
engine: devin # explicit engine; overrides profile/engine-class
cwd: /abs/path/repo
yolo: true
lock: my-repo
timeout: 45m
verify: pnpm -s test
# --- new ---
profile: backend-engineer # role: persona + capability requirements
engine-class: agentic-coder # abstract; scheduler picks a concrete engine if `engine` unset
capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy
prefers: [factory:mac-2] # soft routing hints (affinity)
priority: high # critical|high|medium|low → SLA + preemption
budget: { usd: 5, tokens: 2M, wall: 4h } # wall = HARD ceiling (always enforceable). usd/tokens = best-effort
# caps: enforced only where the engine/provider exposes live metering;
# otherwise estimated from provider usage APIs post-hoc + alerted.
deps: [job-123, job-456] # DAG: don't start until these reach `shipped`/`testing`
idempotency-key: nomgap-ux-2 # dedupe: a second identical submit is a no-op
retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] }
review-policy: manual # auto|manual|reviewers:[@alice]
artifacts: [coverage, screenshots] # what to capture beyond commits
tracker-item: ITEM-789 # link back to the originating tracker task
---
```
- [ ] Define the manifest schema (Zod in the service; documented YAML for `.md`).
- [x] Backward-compat: a Phase-0 `.md` (only `engine/cwd/yolo`) parses with all new fields defaulted. *(P1-S1: bash runner; Zod schema still P2. selftest backward-compat case green.)*
- [x] **Capability grammar** defined: tokens are `key` (presence, e.g. `has:xcode`), `key:value` (e.g. `os:mac`, `engine:devin`), or `key<op>version` with `op ∈ {>=,>,=,<=,<}` (e.g. `node>=20`). `os:any` is a wildcard that matches every factory. A job matches a factory iff every required token is satisfied by the factory descriptor. *(P1-S1: `caps_match`/`detect_capabilities` in `agent-queue.sh`.)*
- [x] **`engine-class` taxonomy** defined as an enum (`agentic-coder`, `chat-coder`, `review-only`) with a documented engine→class map (`devin,claude,codex → agentic-coder`; `copilot → chat-coder`). If `engine` is set it wins; else the scheduler picks any free engine in the class honoring `prefers-engine`. *(P1-S1: `resolve_engine`; `review-only` mapping reserved.)*
- [x] **`idempotency-key` semantics:** `key + content-hash` identical ⇒ no-op (returns existing job). Same `key`, **different** content ⇒ **rejected with 409** unless the prior job is still `queued`/`blocked` (then it is superseded). A re-`run`/`retry` of an existing job is **not** a new submit and never trips dedupe. *(P1-S1: add-time dedupe; bash maps "409" → clear error, `queued` → still in `inbox/` ⇒ superseded.)*
- [x] **`deps` semantics:** a dep is satisfied when it reaches `shipped` (default) or `testing` if `deps-mode: soft`. Submit-time **cycle detection** rejects cyclic graphs; unmet deps put the job in `blocked` (not `queued`). Cross-factory deps require the coordinator (P2); single-host deps work in P1. *(P1-S2: `deps_unmet` skip-with-reason in selection + `status` surfacing; `deps_would_cycle` on `add`. Cross-machine deps remain P2.)*
- **Acceptance:** a manifest fixture suite parses/validates; invalid manifests fail with precise errors; capability-grammar + dep-cycle + idempotency-conflict cases covered.
- **Verify gate:** schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases + grammar/cycle/409 cases).
---
## 6. Profiles — persona + capability (feature)
A **profile** = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as `profiles/<name>.md` (Phase 1) → Cosmos `profiles` container (Phase 2).
```yaml
# profiles/backend-engineer.md
---
name: backend-engineer
persona: |
You are a senior backend engineer. Favor minimal, well-tested changes...
capabilities: [node>=20, has:pnpm]
default-verify: pnpm -s typecheck && pnpm -s test
engine-class: agentic-coder
prefers-engine: [devin, claude]
allowed-scope: ["backend/**", "packages/**"] # blast-radius guardrail
review-policy: manual
---
```
- [x] Author starter catalog: `developer`, `backend-engineer`, `frontend-engineer`, `ux-designer`, `ui-designer`, `qa`, `reviewer`, `docs-writer`. *(P1-S2: `profiles/*.md` + a reserved `planner`.)*
- [x] Persona overlay is **prepended** to the job body before the agent runs; secrets are never written to logs or the event stream (redaction at the source). *(P1-S2: `profile_persona` prepended to the stripped body file.)*
- [x] Profile supplies default `verify`, `capabilities`, `engine-class`, `allowed-scope` when the job omits them. *(P1-S2: `fm_eff` — also `prefers-engine` + `review-policy`; job fields always override.)*
- [ ] Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time). *(P2 — needs Cosmos snapshot at assign time.)*
- [x] `allowed-scope` enforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check). *(P1-S2: `scope_check` post-run WARN-only + `scope_warning=` in meta; `path_in_scope` unit-testable.)*
- **Acceptance:** a job with `profile: backend-engineer` and no `verify` inherits the profile's verify + persona.
- **Verify gate:** profile-resolution unit tests; persona-injection golden test.
---
## 7. The scheduler / router (the heart) (feature)
Given a `queued` job and the current fleet, choose `(factory, station/engine, profile)` and issue a lease.
**Inputs:** job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters.
**Algorithm (deterministic, explainable):**
1. **Filter** factories by **hard capability match** (job profile capabilities ⊆ factory capabilities) and free station for a compatible engine.
2. **Block** if `deps` unmet or `lock` already held → leave `queued`/`blocked`.
3. **Score** each candidate factory:
`score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health w6·starvationPenalty`
4. **Tie-break:** highest priority job first; then oldest; then lowest cost class.
5. **Assign atomically** → create the lease under an optimistic-concurrency guard (`_etag`/`If-Match` or conditional insert keyed by `jobId`) **with a fresh `leaseEpoch`**; on conflict another factory won → retry the next candidate. Set job `assigned`, decrement station/seat capacity, bump fairness counter. Use **coordinator-authoritative timestamps** only.
6. **Preemption (P3+):** a `critical` job may pause a `low` job at a needed station (checkpoint + requeue, bumping the preempted job's `leaseEpoch`).
> **Phasing:** Phase 2 ships the deterministic **filter + atomic-assign core** (fixed weights). Phase 3 adds **tunable weights, preemption, and the explainability UI**. Phase 5 learns the weights (§14).
- [ ] Implement pure, unit-testable scoring function (no I/O) with configurable weights.
- [ ] Hard-filter correctness: never assign a job to a factory missing a required capability.
- [ ] Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware).
- [ ] Fairness: no factory or product starves under sustained load (counter + penalty).
- [ ] Explainability: every assignment records *why* (matched caps, score breakdown) in the event log.
- [ ] Determinism: same inputs → same decision (seeded tie-breaks) for testability.
- [ ] Define **factory health** ∈ [0,1] = f(heartbeat freshness, recent run failure-rate, resource pressure); factories below a health floor are **filtered out**, not merely down-weighted.
- [ ] **Station/seat capacity:** a factory's free stations = `min(host slots, per-engine seat limits)` (e.g. licensed Devin/Claude seats); the scheduler never over-subscribes a seat-limited engine.
- [ ] **Distributed lock:** the Phase-0 local `lock` becomes a **coordinator-held lock** so same-`lock` jobs serialize across the whole fleet (prevents two factories pushing the same repo concurrently).
- **Acceptance:** scenario fixtures (10+) produce expected assignments incl. starvation, capability-miss, seat-exhaustion, unhealthy-factory exclusion, and budget-exceed; a concurrent-claim race test proves exactly one winner.
- **Verify gate:** router unit suite ≥ 95% branch coverage on the scoring/filter core; atomic-claim race test.
---
## 8. Factory model & registration (feature)
Each machine runs a **factory agent** (the evolved `agent-queue` runner) that registers, heartbeats, claims jobs, and reports events.
- [ ] **Capability auto-detection** at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
- [ ] **Enrollment / bootstrap trust**: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a **scoped, rotatable factory token** (`jose` JWT); decommission = revoke. No standing shared secret in the queue.
- [ ] **Registration**: `POST /fleet/factories` with descriptor → receives a factory id + token.
- [ ] **Heartbeat**: periodic `PUT /fleet/factories/:id/heartbeat` (load, free stations, health). A **coordinator lease reaper** (not Cosmos TTL) sweeps `expiresAt < now` and reclaims, **bumping `leaseEpoch`** so the dead/zombie worker is fenced; a factory missing N heartbeats is marked `offline` and all its leases reclaimed.
- [ ] **Claim loop**: `POST /fleet/leases/claim` advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + `leaseEpoch`. Use **claim backoff / long-poll** to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4.
- [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`), **echoing `leaseEpoch`** (stale epoch → 409, worker self-aborts); renew lease while alive.
- [ ] **Environment prep**: before `verify`, the factory ensures deps are installed (cold checkout → `pnpm install`); prep time counts against `budget.wall`.
- [ ] **Graceful drain**: factory can stop claiming, finish in-flight, deregister.
- **Acceptance:** a factory enrolls, claims a matching job, heartbeats, completes; a killed factory's job is reclaimed by another within the lease TTL and the killed worker's late report is **rejected by fencing**.
- **Verify gate:** factory-agent integration test against a mock coordinator; crash-recovery + fencing-rejection test.
---
## 9. Coordination architecture (decision + path)
Three transports were evaluated. **Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.**
| Option | Pros | Cons | Verdict |
| ------ | ---- | ---- | ------- |
| (a) **Git-synced queue** (evolve folders) | zero infra, audit-by-commit, offline | weak/racey leasing, latency, merge churn | **Edge/offline only** |
| (b) **Coordinator service** (platform-service module) | real leases, fairness, observability, reuses auth/Cosmos/productId | a service to run | **Chosen spine (P2)** |
| (c) **Message broker** (NATS/Redis/SQS) | scale, backpressure, push dispatch | most moving parts/ops | **P4 when throughput demands** |
- [ ] Document the decision + rationale in-repo (this section is the canonical record).
- [ ] Define the **claim/lease protocol** once; both git-queue (poll) and service (API) implement it.
- [ ] **Split-brain / network-partition safety:** a partitioned factory may keep running and even `git push`. `idempotency-key` dedupes *submits* but cannot undo *side-effects*. Mitigation: **fencing** — the coordinator rejects `ship`/merge reports from a stale `leaseEpoch`, and the distributed `lock` (§7) prevents a reclaimed-job's twin from pushing the same repo. Residual risk (a stale push to a feature branch) is contained by the PR-merge ship gate (§10) and surfaced for human triage.
- [ ] **Offline-degrade**: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect; on reconnect it presents its `leaseEpoch` — if reclaimed, its results are quarantined, not auto-merged.
- [ ] **Poll cost**: bound claim-loop RU via long-poll/backoff (§22); migrate to broker push at P4.
- **Acceptance:** the same job manifest runs identically through the bash/git path and the service path; a simulated partition does not double-merge (fencing test).
- **Verify gate:** contract test asserting protocol parity (git vs service) + partition/fencing test.
---
## 10. tracker-web / platform-service integration (committed path)
**Layering:** tracker = *WHAT/WHY* (plan, intake, prioritize, roadmap, votes) · gigafactory = *HOW* (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real `tracker-service` model (`Item`: `type` bug/feature/**task**, `status` open/in_progress/done/closed/wont_fix, priority, labels, assignee, `source` incl. **auto_detected**, votes, comments, public roadmap) and the `tracker-web` `/api/tracker/[...path]` proxy pattern.
### Phase 1 — Adapter (no new infra)
- [x] **task → job**: a tracker `Item` of `type: task` (e.g. `assignee: @agent` or label `agent:run`) is exported to a job `.md` (manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints). *(P1-S4: `aq from-tracker`; labels `engine-class:`/`profile:`/`priority:`/`cap:` → frontmatter.)*
- [x] **job → tracker**: lifecycle events post back as **status updates + comments**`building` → status `in_progress` + comment "started on factory X"; `shipped``done` + comment with commit SHAs / PR link / verify results; `failed` → comment with reason (status stays `in_progress` for human triage). *(P1-S4: `aq to-tracker` PATCHes status + posts a metrics-only comment; one-way echo §24.5; never fatal. The items API has no blocked/failed status, so failures map to `wont_fix` by default — override via `AQ_TRACKER_STATUS_FAILED`.)*
- [x] Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash). *(P1-S4: derived `idempotency-key: tracker-<id>` reuses Slice 1 dedupe; `to-tracker` is idempotent via `tracker_echoed`.)*
- [x] Adapter is a thin script/CLI (`aq from-tracker ITEM-789`) + optional poller. *(P1-S4: `from-tracker`/`to-tracker` + opt-in `AQ_TRACKER_AUTO` auto-echo; a standalone poller is deferred.)*
- **Acceptance:** filing a tracker task, marking it `agent:run`, results in a queued job; on ship, the item flips to `done` with a SHA comment.
- **Verify gate:** adapter e2e against a tracker-service test instance (or mock); round-trip assertion.
**Stage → tracker status mapping** (tracker's enum is coarser than the fleet's; keep fine-grained stage in a label + comment so no detail is lost):
| Fleet stage | Tracker `status` | Extra |
| ----------- | ---------------- | ----- |
| `queued` / `assigned` / `blocked` | `in_progress` | label `fleet:<stage>` |
| `building` / `review` / `testing` | `in_progress` | label `fleet:<stage>` + progress comment |
| `shipped` | `done` | comment with SHA(s)/PR link/verify result |
| `failed` / `dead_letter` | `in_progress` + label `needs-triage` | never auto-`closed`/`wont_fix` (humans decide) |
**Ship semantics (PR flow):** `shipped` = change **merged to target branch with CI green** (default), OR `pr-opened` when `review-policy` defers merge to humans/CI — configurable per profile. This honors the non-goal that CI still gates merges (§3); the agent never bypasses branch protection.
### Phase 2 — Native spine
- [ ] Stand up a `fleet` (a.k.a. `orchestrator`) module **inside platform-service**, sibling to `tracker-service`: pattern `types.ts → repository.ts → routes.ts`, ESM, Cosmos, `productId`, `req.log`.
- [ ] Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats.
- [ ] Runners (bash + any) become API clients of this module; tracker adapter calls it directly.
- **Acceptance:** a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos.
- **Verify gate:** module test suite (repository + routes) using the shared `@bytelyst/testing` inject helpers.
### Phase 3 — Unified control plane
- [ ] Add a **Fleet** surface to `tracker-web` reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, **live log streaming**, lease/heartbeat status, cost burndown, approve/ship buttons.
- [ ] **Streaming caveat (correctness):** live logs **must not** use the existing buffering catch-all proxy `/api/tracker/[...path]` — it does `res.text()` and would never stream. Use a **dedicated Next.js Route Handler returning a `ReadableStream` (SSE)** or a direct SSE/WebSocket to platform-service. Full logs are shipped to blob storage (§17); the endpoint serves stored tail + live append.
- [ ] The Node TUI dashboard becomes a thin client of the same `/fleet` API (parity with web).
- **Acceptance:** an operator can watch all factories + tail any job log + ship from the browser.
- **Verify gate:** web e2e (Playwright) covering fleet map render, live log, and a ship action.
---
## 11. Lifecycle & gates at scale (feature)
- [ ] Canonical stages enforced server-side: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`); transitions validated (illegal transition → 409).
- [ ] Per-profile default `verify`; per-job override; verify runs at the factory, result reported as an event.
- [ ] Human gates: `review-policy` routes to reviewers; multi-reviewer support (P3).
- [x] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped. *(P1-S3 single-host stand-in: `failed/` `result=retries_exhausted`, WIP branch + full log preserved.)*
- [ ] **Backpressure**: when no factory can take more, jobs stay `queued` (no thrash); SLA timers visible.
- [ ] **Ship semantics** are profile-configurable (merged+green vs `pr-opened`, §10); `shipped` is terminal-success, `dead_letter` terminal-failure; `blocked` (unmet deps) is distinct from `queued`.
- [x] **Retry vs idempotency**: a retry creates a new `fleet_runs` attempt under the same job/`idempotency-key` (never a duplicate job); backoff honored; `retry.on` filters which failure classes retry. *(P1-S3 single-host: `attempts` counter survives requeue; `backoff`→`next_eligible` gates selection; `on` filters timeout/verify_failed/crash.)*
- **Acceptance:** a perpetually-failing job lands in `dead_letter` after configured retries; a passing one auto-advances to `testing` then waits for human `ship`; an illegal transition is rejected.
- **Verify gate:** lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path).
---
## 12. Security, safety & governance (feature — critical with `yolo`/dangerous)
- [ ] **Secret isolation**: creds live on each factory (env/keychain), **never** in the queue, manifest, logs, or Cosmos. Factory advertises *presence* of a cred capability, not the value.
- [ ] **Scoped git tokens** per factory/repo; least-privilege; rotation documented.
- [ ] **Push policy**: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag.
- [ ] **Blast-radius guardrail**: enforce `allowed-scope` — pre-flight + post-run diff check; out-of-scope changes block the ship gate.
- [ ] **Budget kill-switch**: exceed `budget` (usd/tokens/wall) → pause worker, alert, require human resume.
- [ ] **Supply-chain safety**: edits to shared `@bytelyst/*` packages require `reviewer` profile + human gate (never auto-ship).
- [ ] **Audit trail**: append-only event log per job (who/what/when/where/cost); immutable.
- [ ] **Corp network/proxy**: honor `NETWORK`/proxy + truststore conventions on factories that need them.
- [ ] **Kill switch (global)**: one command/flag halts all claiming fleet-wide (incident response).
- **Acceptance:** a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test).
- **Verify gate:** security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test.
---
## 13. Data model (Cosmos containers, P2+)
Each container partitioned sensibly; every doc has `productId`.
- [x] `fleet_jobs` (pk `/productId`) — manifest snapshot **+ the full instruction body verbatim as markdown (`bodyMd`)**, current stage, idempotency-key, tracker-item link, `checkpoint` pointer (WIP branch/commit). This is the **durable source of truth for instructions** — a factory holds only a transient materialized copy, so a machine going down loses nothing (§25).
- [x] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, exit, verify result, **and execution insights: model, tokensIn/Out (+cached), cost (`estimated` flag), turns, tool-call counts, filesChanged, linesAdded/Deleted, attempt number** (§26).
- [x] `fleet_leases` (pk `/jobId`) — holder factory, `expiresAt`, **`leaseEpoch` (fencing)**, renewals. **Reclaim via a coordinator reaper** that scans `expiresAt < now` — Cosmos TTL only garbage-collects stale rows, it **cannot trigger reclaim logic**. Claim guarded by `_etag`/`If-Match`.
- [x] `fleet_factories` (pk `/productId`) — descriptor, capabilities, health, load, last heartbeat, seat limits.
- [x] `fleet_profiles` (pk `/productId`) — versioned profile snapshots (immutable per version).
- [x] `fleet_events` (pk `/jobId`) — append-only audit/event stream (stage changes, log pointers, cost ticks, scheduler decisions).
- [ ] `fleet_artifacts` (pk `/jobId`) — pointers to **blob-stored** logs + artifacts (coverage, screenshots, build output). Large logs live in `@bytelyst/blob`, **never** inline in Cosmos (doc-size + RU limits).
- [ ] Relate to existing tracker `Item` via `tracker-item` (no duplication of planning data).
- [x] **Optimistic concurrency** (`_etag`) on every job stage transition + lease claim to prevent lost updates / double-assignment. *(PR #29: `updateIfMatch`.)*
- [ ] **Indexing/RU**: the claim query is hot — index `stage`, `priority`, `capabilities`; avoid cross-partition fan-out; provision RU/s per §22.
- **Acceptance:** repository CRUD + query tests per container; **atomic-claim race test (N concurrent claimers → exactly one wins)**; reaper-reclaim + fencing-rejection test; lease-expiry verified via reaper (not TTL).
- **Verify gate:** repository unit/integration tests (memory + Cosmos provider via `DB_PROVIDER`).
---
## 14. Phased build roadmap (checklists)
Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress.
### Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)
**Goal:** richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.
> **Slice progress — P1-S1:** manifest parsing (all §5 fields, defaulted + backward-compatible), `priority` ordering, capability detection+match gate, `engine-class` resolution, and `idempotency-key` dedupe are **done** on the bash runner.
>
> **Slice progress — P1-S3 (resilience & insights, single host):** crash recovery (`recover_orphans` + `aq recover`), git WIP checkpoint/resume (`aq/wip/<job>`), functional `retry` policy (backoff + `retries_exhausted`), and execution insights (`parse_usage`, per-run metrics in meta, `aq insights`, `status`/`dash` insights) are **done** — see §11/§25/§26.
>
> **Slice progress — P1-S2 (profiles + deps/DAG, single host):** the `profiles/` catalog + resolution (`fm_eff` inheritance with job>profile>default precedence, persona injection), the warn-only `allowed-scope` guardrail (`scope_check`/`path_in_scope`), and single-host `deps` (block-with-reason in selection, `status` surfacing, submit-time cycle detection) are **done** — see §5/§6.
>
> **Slice progress — P1-S4 (tracker adapter, single host):** the task ↔ job round-trip is **done** (§10) — `aq from-tracker` materializes a job from a tracker Item (idempotent on `tracker-<id>`, label→manifest mapping), `aq to-tracker` echoes status + a metrics-only comment one-way (idempotent via `tracker_echoed`, never fatal), and opt-in `AQ_TRACKER_AUTO` auto-echoes on transitions. All HTTP is curl-only through one wrapper (test seam `AQ_TRACKER_API_CMD`). **This closes the Phase-1 §14 tracker-adapter item.** Remaining P1 extras: `budget.wall` (P1-S3 left it) and Node-`dash` surfacing of the new fields.
- [x] Extend `agent-queue.sh` frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible. *(P1-S1)*
- [x] Add `profiles/` directory + profile resolution (persona injection, default verify/caps/scope) (§6). *(P1-S2)*
- [x] Local capability detection + a job/factory capability match check before launch (§8 subset). *(P1-S1: `detect_capabilities` + `caps_match`; mismatch ⇒ `failed/` `result=capability_mismatch`, agent never launched.)*
- [x] `priority` ordering in the inbox pick (replace pure FIFO with priority-then-age). *(P1-S1: `inbox_sorted`; per-lock serialization preserved.)*
- [x] `deps` (DAG) blocking on a single host; `idempotency-key` dedupe on `add`. *(P1-S1 idempotency dedupe + P1-S2 `deps` blocking/cycle detection.)*
- [ ] `retry` with backoff into `failed`/requeue; `budget.wall` enforced (extends `timeout`). *(P1-S3: `retry` with backoff + `retries_exhausted` DONE; `budget.wall` still pending.)*
- [x] `allowed-scope` guardrail (warn-only this phase) + post-run diff report. *(P1-S2: `scope_check` WARN-only + `scope_warning=`.)*
- [x] **Tracker adapter** `aq from-tracker <ITEM>` + `aq to-tracker` event poster (§10 P1). *(P1-S4: curl-only `tracker_api`; from-tracker materializes a job (idempotent), to-tracker echoes status+metrics one-way; opt-in `AQ_TRACKER_AUTO`. A standalone background poller is deferred to P2.)*
- [ ] Dashboard shows profile + priority + capability tags + tracker-item link. *(P1-S1: `status` shows priority/profile/caps/tracker-item; P1-S4: status/insights also show last echoed tracker status; Node `dash` surfacing pending.)*
- [x] Update `selftest.sh` with: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock). *(P1-S1 manifest/priority/idempotency + P1-S2 profile/persona/scope/dep-block/cycle + P1-S3 resilience/insights + P1-S4 tracker from/to round-trip via stub.)*
- [x] Update README + this doc's progress table. *(P1-S1)*
- **Exit criteria:** all boxes ✅; `selftest.sh` green; a tracker task → executed → tracker `done` with SHA comment, fully on one host; no regression to Phase-0 `.md` files.
### Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing
**Goal:** the service spine; ≥2 real factories executing in parallel via leases.
> **Slice progress — P2-S3 (factory-agent integration, single host):** the bash runner
> is now a coordinator **factory** behind `AQ_FLEET` — `lib/fleet-client.sh` (curl-only,
> sourced) registers via heartbeat, claims jobs into inbox (interleaved with local `.md`),
> reports **fenced** stage transitions with WIP checkpoints, renews/releases leases, and on
> a stale `leaseEpoch` (reclaimed) **self-aborts + quarantines** the local result. Coordinator
> 5xx/connection errors **degrade** (finish locally) rather than abandon work. When `AQ_FLEET`
> is off the offline git-queue path is byte-for-byte unchanged. Remaining P2: scheduler/router
> core, direct tracker→module calls, factory enrollment + scoped tokens, `fleet.*` feature
> flags + shadow/dual-run, and the two-factory parallel demo (the Phase-2 exit criteria).
- [x] Scaffold `fleet`/`orchestrator` module in `platform-service` (`types/repository/routes`, Zod, ESM, `productId`). *(PR #28)*
- [x] Cosmos containers (§13) + repository layer (memory + Cosmos providers). *(PR #28; `fleet_artifacts` blob wiring still pending.)*
- [x] **Atomic claim** (optimistic concurrency / `_etag`) + **lease reaper** + **fencing (`leaseEpoch`)** endpoints (§4/§8/§9) — *not* Cosmos-TTL-driven reclaim. *(common-plat PR #28 + #29; truly atomic via `updateIfMatch`.)*
- [x] Port `agent-queue` runner to a **factory agent** API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback. *(P2-S3: `lib/fleet-client.sh` behind `AQ_FLEET`; registers via heartbeat, claims into inbox, reports fenced stage transitions, renews leases, quarantines on stale-epoch; offline git-queue unchanged when the flag is off.)*
- [ ] Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment.
- [ ] Tracker adapter calls the module directly (not just file export).
- [ ] Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset).
- [x] **Feature flags** (`fleet.enabled`, `fleet.route_via_service`) + **shadow/dual-run** vs P1 before cutover (§21). *(agent-queue runner: `AQ_FLEET` / `AQ_FLEET_ROUTE` / `AQ_FLEET_SHADOW` with documented precedence; shadow claim/compare/report is side-effect-free (isolated `-shadow` factoryId + dryRun, never materializes/ships); `fleet-shadow-report` summarizes AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY + agreement; 60→68 selftest checks.)*
- [x] Module test suite (repository + routes via `@bytelyst/testing`); **atomic-claim race**, crash-recovery, fencing-rejection, reaper-reclaim tests. *(PR #28 + #29: 53 fleet + 48 datastore tests, incl. true-concurrency claim.)*
- [x] Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end. *(`agent-queue/demo/two-factory-demo.sh` + `coordinator-stub.sh`: two real `run` daemons (mac-1 + ubuntu-1, separate queues/cwds) compete through one coordinator; asserts (a) no double-assign, (b) kill-mid-job → reaper reclaim → survivor completes → zombie report fenced (409), (c) concurrent parallelism. Dual-mode: CI-safe stateful stub by default, live platform-service when `AQ_FLEET_API`/`AQ_FLEET_TOKEN` set. Headless checks in `selftest.sh` → 68→71 green.)*
- **Exit criteria:** all boxes ✅; `pnpm --filter @lysnrai/platform-service test` green; killing a factory mid-job → another reclaims and completes **and the dead worker's late report is fenced**; concurrent claimers never double-assign; all state in Cosmos with `productId`; **flag-off rollback verified** (§21). — _Runtime exit guarantees **demonstrated** by the two-factory demo (no double-assign + reclaim/fence + parallelism) and flag-off rollback verified (§21). **Remaining for 100%:** scheduler/router core wired into assignment (common-plat PR #31, open), tracker adapter direct call, and factory enrollment + scoped tokens._
### Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router
**Goal:** one browser control plane; smart routing + budgets live.
- [x] `fleet` API client in `tracker-web` (reuse `/api/tracker`-style proxy → `/fleet`). *(common-plat `dashboards/tracker-web/src/lib/fleet-client.ts`: typed client over `/api/fleet`.)*
- [x] Fleet map page (factories, load, health, capabilities) on `@bytelyst/*` components. *(common-plat `app/dashboard/fleet/page.tsx`: health badges, load, capabilities, fleet metrics + alerts.)*
- [x] Job table + job detail + **DAG view**; live log via **SSE**; approve/ship/reject/requeue actions. *(common-plat `app/dashboard/fleet/jobs/page.tsx` + `jobs/[id]/page.tsx`: stage-filtered table, DAG via `getJobDag`, SSE event stream, ship/requeue/reject/requestReview.)*
- [x] Cost burndown + budget kill-switch UI; multi-reviewer routing. *(common-plat `app/dashboard/fleet/budget/page.tsx` burndown + pause/resume; `ReviewGateCard` multi-reviewer quorum gate via `requestReview`/`submitReview`.)*
- [x] Scoring router with configurable weights + explainability surfaced in UI. *(common-plat `fleet/scheduler.ts` tunable weights + `GET /fleet/jobs/:id/explain`; `ExplainPanel` breakdown in job detail.)*
- [x] Preemption of low-priority by critical jobs (checkpoint + requeue). *(common-plat `fleet/scheduler.ts` `selectPreemptionVictim` + coordinator eviction under `FLEET_PREEMPTION`; victim requeued with checkpoint + bumped epoch, `preempted` event.)*
- [ ] TUI dashboard re-pointed at `/fleet` API (parity).
- [x] Web e2e (Playwright): fleet map, live log, ship, budget-pause. *(common-plat `dashboards/tracker-web/e2e/fleet.spec.ts`: fleet overview, metrics, job detail, ship, budget-pause, review-gate specs green.)*
- **Exit criteria:** all boxes ✅; web `verify` (typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume.
### Phase 4 — Message bus + autoscaling + cross-OS capability marketplace
**Goal:** scale-out and elasticity.
- [ ] Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability.
- [ ] Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA.
- [ ] Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products.
- [ ] Load + chaos test suite (factory churn, broker outage, thundering herd).
- **Exit criteria:** all boxes ✅; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll).
### Phase 5 — Self-optimizing / learned routing
**Goal:** the scheduler learns from history to cut time/cost and raise first-pass success.
- [ ] Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate).
- [ ] Offline eval harness comparing learned vs heuristic routing on historical data.
- [ ] Shadow/A-B rollout with guardrails; auto-tune scoring weights.
- [ ] Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper").
- **Exit criteria:** all boxes ✅; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric.
---
## 15. Cross-cutting feature catalog (quick index)
| Feature | First phase | Section |
| ------- | ----------- | ------- |
| Evolved job manifest | P1 | §5 |
| Profiles (persona + capability) | P1 | §6 |
| Capability matching | P1→P2 | §6/§8 |
| Priority + SLA | P1 | §5/§7 |
| DAG dependencies | P1→P3 | §5/§11 |
| Idempotency / dedupe | P1 | §5 |
| Retry + dead-letter | P1→P2 | §11 |
| Budgets + kill-switch | P1(wall)→P3 | §5/§12 |
| Scheduler/router scoring | P2→P3 | §7 |
| Factory registration/heartbeat/lease | P2 | §8 |
| Coordinator (platform-service module) | P2 | §9/§10 |
| Cosmos data model | P2 | §13 |
| Tracker bi-directional sync | P1→P2 | §10 |
| Web control plane + SSE logs | P3 | §10/§17 |
| Security/scope/secret isolation | P1→P2 | §12 |
| Broker + autoscaling | P4 | §14 |
| Learned routing | P5 | §14 |
| Atomic claim + fencing + distributed lock | P2 | §4/§7/§9 |
| Rollout / rollback / feature flags | P2→ | §21 |
| Capacity planning & RU/cost | P2→ | §22 |
| Ownership & RACI / on-call | all | §23 |
| Work hierarchy & composite delegation (roadmap/epic) | P3 (manual) → P5 (planner) | §24 |
| Durability, crash recovery & work preservation | P1 (orphan/retry/WIP) → P2 (lease/resume) | §25 |
| Execution insights & token accounting | P1 (capture) → P3 (rollup UI) | §26 |
---
## 16. Definition of Done — the "100% accuracy" rubric
A feature/phase is **not done** until **every** item below is true (this is the bar for "100% end-to-end"):
- [ ] **Functionality**: acceptance criteria met; happy path + documented edge cases handled.
- [ ] **Tests**: unit + integration written *first or alongside*, all green; no weakened/deleted tests; coverage targets met (router ≥95% core).
- [ ] **Verify gate**: the phase's named gate command passes locally (and in CI where applicable).
- [ ] **Idempotency & recovery**: re-runs are safe; crash mid-step recovers (lease/idempotency).
- [ ] **Security review**: secret-leak scan clean; scope guardrail honored; least-privilege tokens.
- [ ] **Observability**: events/logs/metrics emitted; failures are diagnosable from the control plane.
- [ ] **Docs**: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current.
- [ ] **Backward-compat**: existing `.md`/Phase-0 behavior unbroken (regression check).
- [ ] **Drift checks**: shared-infra templates (`.npmrc`, `docker-prep`) untouched/synced; conventional commits.
- [ ] **No `console.log`/`print`** in service code; `req.log`/`os.Logger` used; ESM `.js` imports.
---
## 17. Observability & control plane details
- [ ] **Log transport/storage**: factory ships logs to blob (`@bytelyst/blob`); `fleet_events` carries pointers + a recent-tail buffer. The control plane serves stored tail + live append (via the streaming route, **not** the buffering proxy — §10).
- [ ] **Live logs** via SSE (single stream contract) from the streaming endpoint to web/TUI.
- [ ] **Metrics**: queue depth, `blocked` count, assign latency, claim-loop RU/s, run duration, verify pass-rate, cost, factory utilization, fairness, reclaim/fencing-rejection counts.
- [ ] **Alerting**: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter, **claim-race anomalies**, RU throttling (Cosmos 429s).
- [ ] **Tracing**: a job's full timeline (queued→…→shipped) reconstructable from `fleet_events` (immutable, ordered).
- [ ] **Cost burndown** per job/product/day with budget overlays.
- [ ] **SLOs defined + dashboarded** (see §19 targets); error budget tracked per SLO.
---
## 18. Risks & gaps explicitly tracked (expert call-outs)
- [ ] **Duplicate execution** across transports (git fallback + service) — `idempotency-key` (submit) + atomic lease (assign) + **fencing token** (side-effect) + distributed `lock` (push).
- [ ] **Crash recovery** — coordinator **lease reaper + fencing** (not Cosmos TTL); checkpoint long jobs where engines allow.
- [ ] **Split-brain / partition** — fencing rejects stale `leaseEpoch` writes; reclaimed-job results quarantined, not auto-merged (§9).
- [ ] **Shared-package conflicts** — two jobs editing `@bytelyst/*` simultaneously → fleet-wide `lock` + reviewer gate.
- [ ] **Starvation/fairness** — per-product + per-factory counters with penalty.
- [ ] **Cost runaway**`budget.wall` hard ceiling everywhere; `usd`/`tokens` best-effort (provider metering) + global kill switch.
- [ ] **Cosmos RU throttling (429)** — hot claim path; bound via long-poll/backoff + indexing (§13/§22); broker offload at P4.
- [ ] **Clock skew** — coordinator-authoritative timestamps for all lease/SLA math (§4).
- [ ] **Tool-version drift / reproducibility** — record engine + tool versions per run; pin where possible.
- [ ] **Windows quirks** — path/shell differences in the factory agent; capability-gate Windows-only work.
- [ ] **Human-review bottleneck** — auto-verify as much as possible; batch review UI; reviewer routing.
- [ ] **Result capture beyond commits** — artifacts (coverage, screenshots, build logs) attached to runs.
- [ ] **Secret sprawl** — never in queue/manifest/logs/Cosmos; presence-only capabilities.
- [ ] **Data retention** — event/log retention + archival policy (extend today's `clean`).
- [ ] **Engine API churn** — engines mapped in one place (`build_agent_cmd`); capability matrix versioned.
---
## 19. Success metrics
Each metric has a **provisional SLO target** (tune with real data; tracked with an error budget):
| Dimension | Metric | Provisional SLO target |
| --------- | ------ | ---------------------- |
| Throughput | jobs shipped/day; parallel utilization | utilization ≥ 60% under backlog |
| Quality | % auto-verified; first-pass success; escaped-defect; post-agent human-edit rate | first-pass ≥ 70%; escaped-defect < 2% |
| Speed | assign latency; time queuedshipped (excl. human gate) | assign p95 < 5s; queue-wait p95 < 2m at target load |
| Cost | $/shipped job; budget-breach rate | budget-breach < 1% of jobs |
| Reliability | lease-reclaim success; dead-letter rate; factory uptime; double-execution incidents | reclaim success 99.9%; **double-merge = 0**; dead-letter < 5% |
| Fairness | max/min product wait-time ratio | ratio < 3× |
| Correctness | atomic-claim violations; fencing rejections functioning | claim violations = 0 |
> Targets are starting points; the §0 owners ratify per-phase SLOs before that phase's exit.
---
## 20. Open questions
- [ ] Copilot headless feasibility as an engine/station (CLI/automation surface?).
- [ ] Who owns merge/push authority agents open PRs only, or auto-merge on green for low-risk profiles?
- [ ] Multi-user/tenant: per-user queues + RBAC in the control plane?
- [ ] On-call/ownership for the fleet (alerts routing, runbooks)?
- [ ] Cloud factory provisioning (Phase 4) which provider/runtime, cost guardrails?
- [ ] Profile authorship/governance who can create/edit profiles, and review of persona prompts?
---
## 21. Rollout, rollback & data migration
Each phase ships behind controls so it can be turned off without losing work.
- [ ] **Feature-flagged rollout**: gate each phase's new path behind a platform feature flag (`fleet.enabled`, `fleet.route_via_service`, `fleet.tracker_sync`); default off; enable per-product first.
- [x] **Dual-run / shadow**: P2 coordinator runs in shadow (assign decisions logged, not executed) alongside the P0/P1 path before cutover; compare decisions. *(agent-queue `AQ_FLEET_SHADOW=1`: offline path stays authoritative, coordinator queried in parallel, decisions classified AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY into `.state/fleet-shadow.log`; strictly side-effect-free — never ships/quarantines/mutates real job state.)*
- [x] **Cutover is reversible**: a factory can fall back from service-claim to git-queue via flag; no schema-destructive step on the rollback path. *(rollback = `AQ_FLEET_ROUTE=0` and/or `AQ_FLEET=0` at any time → instant return to the local/offline path; no data migration.)*
- [ ] **Data migration**: introducing Cosmos containers (P2) is **additive** no migration of existing tracker data; backfill is read-only (link `tracker-item`, don't mutate). Container creation is idempotent (registered in `cosmos-init`).
- [ ] **Backward-compat gate**: every phase re-runs Phase-0 `selftest.sh` + a corpus of legacy `.md` files (regression).
- [ ] **Rollback drill**: each phase's exit includes a tested rollback (flag off prior behavior, in-flight jobs drain or requeue cleanly).
- **Acceptance:** flipping `fleet.*` flags off returns the system to the prior phase's behavior with zero data loss; in-flight jobs either complete or requeue.
- **Verify gate:** rollout/rollback drill documented + a flag-off regression run is green.
---
## 22. Capacity planning & cost
- [ ] **Concurrency model**: fleet throughput = Σ factory free-stations, bounded by per-engine **seat limits** (e.g. N Devin seats) document seat inventory per engine before P2.
- [ ] **Cosmos RU budgeting**: the claim/heartbeat paths are the hot loops. Estimate RU/s = (factories × claim-poll rate × query RU) + (factories × heartbeat rate × upsert RU); pick **long-poll interval** to keep steady-state RU within a provisioned budget; enable autoscale RU with a ceiling + 429 alerting.
- [ ] **Polling vs push**: at F factories the poll RU grows linearly define the F threshold that triggers the P4 broker migration.
- [ ] **Blob storage**: logs/artifacts sizing + lifecycle (hot cool delete) per retention policy 18).
- [ ] **Factory sizing**: per-OS resource baseline (CPU/RAM/disk for N concurrent agent sessions + warm checkouts); disk pressure as a health input.
- [ ] **Cost guardrails**: per-product spend caps + alerts; ties to `budget` and the global kill-switch.
- **Acceptance:** a documented capacity sheet (seats, RU/s, blob GB, factory specs) sized for the target steady-state + 2× burst.
- **Verify gate:** load test sustains target throughput within the RU/cost budget (no 429 storms).
---
## 23. Ownership & RACI
Owners are roles, not names assign before each phase starts (this removes the "undefined owner" gap).
| Area | Responsible (R) | Accountable (A) | Consulted (C) | Informed (I) |
| ---- | --------------- | --------------- | ------------- | ------------ |
| Runner / factory agent (bash) | DevOps eng | Platform lead | | All |
| Coordinator module (platform-service) | Backend eng | Platform lead | Security | All |
| Scheduler/router | Distributed-systems eng | Platform lead | Backend | All |
| Control plane (tracker-web Fleet) | Frontend eng | Platform lead | UX | All |
| Security/governance | Security eng | Security lead | Platform | All |
| Capacity/cost & SLOs | SRE | Platform lead | Finance | All |
| Profiles & persona governance | Eng leads | Platform lead | | All |
- [ ] Each phase names its R/A before kickoff; SLOs 19) ratified by A.
- [ ] On-call + runbooks established before the fleet runs unattended `yolo` workloads (Phase 2+).
---
## 24. Work hierarchy & composite delegation (roadmap / epic)
**Goal:** delegate work at *any* granularity a single bug/feature/task, **or an entire roadmap** and let the fleet decompose + orchestrate rather than hand a multi-day roadmap to one agent session (which is long-horizon, low first-pass-success, and high blast-radius under `yolo`).
### 24.1 Two delegation modes
- **Atomic** (today's model): one leaf item (`bug`/`feature`/`task`) one job one agent at one station.
- **Composite** (new): a `roadmap`/`epic` a **planner** profile expands it into child jobs the scheduler runs them as a **DAG across factories/agents/profiles**, honoring `deps` + phase gates. "Delegate the whole roadmap" = hand it to the **orchestrator**, which fans out never one agent grinding for hours.
### 24.2 Job `kind` — the one genuinely new concept
A new axis, **orthogonal to tracker `type`**:
- **`kind: leaf`** runs an engine at a station (everything Phase 12 already does).
- **`kind: composite`** runs the **planner/orchestrator** that emits child `leaf` jobs and a dependency graph; it never itself edits a repo.
The scheduler 7) routes by `kind`: `leaf` station/engine; `composite` planner. This keeps execution and planning cleanly separated.
### 24.3 Hierarchy & relationships
- [ ] `parentId` links a child job/item to its roadmap/epic; `deps` 5) expresses ordering within it (DAG, submit-time cycle detection).
- [ ] A roadmap is, mechanically, a **named DAG of jobs + a rollup** it reuses `deps`, profiles 6), the scheduler 7), and the lifecycle 11); the only additions are `kind`, `parentId`, and rollup logic.
- [ ] Add a **`planner`/`architect`/`tech-lead` profile** 6 catalog) for decomposition + orchestration; leaf work still uses `backend-engineer`, `ux-designer`, etc.
### 24.4 Rollup semantics (composite-level)
- [ ] **Status rollup:** roadmap `status` is derived from children `in_progress` once any child starts; `shipped`/`done` only when **all** children reach `shipped`; surfaces `blocked`/`failed` children for triage.
- [ ] **Budget rollup:** roadmap `budget` = Σ child budgets with an explicit **ceiling**; breaching the ceiling pauses fan-out (ties to §12 kill-switch).
- [ ] **Verify rollup:** each leaf runs its own `verify`; the roadmap's acceptance gate runs **after** all leaves pass (e.g. an integration/e2e gate).
- [ ] **Phase gates:** the roadmap's own phase Exit-criteria become **runtime gates** fan-out of phase N+1 is blocked until phase N's children ship; human approval between phases is the default for `yolo` safety.
- [ ] **Idempotent re-run:** re-running a roadmap **skips already-`shipped` children** (content-hash dedupe, §5); only unfinished/changed children re-queue.
### 24.5 Source-of-truth & sync (no drift)
Composite work obeys the same SoT discipline as the core contract 4 immutable manifest) and the tracker echo 10): a roadmap/epic is **one record referenced by many**, never duplicated.
- [ ] The **roadmap/epic** is the SoT for *what/why + rollup status*; each **leaf job/run** is the SoT for *its* execution.
- [ ] Children reference the parent by `parentId`; the planner writes the child set **once** at decomposition (immutable manifest snapshot). Re-planning creates a new revision, it does not mutate in-flight children.
- [ ] Status flows **one way, child → parent → tracker** (the §10 echo); humans never hand-edit rollup state.
### 24.6 Decision — **Hybrid** (recorded)
> Model composite delegation in the **fleet layer now**; defer the shared-platform enum change until proven.
- **Now (fleet-owned):** add `kind` (`leaf`/`composite`), `parentId`, and rollup to the `fleet_jobs` schema 13). The fleet owns this schema outright no cross-product risk.
- **Tracker stays `bug`/`feature`/`task`** (the shared `ITEM_TYPES` used by all 9 products is unchanged). A roadmap is represented by a **parent item + label `kind:roadmap`** + `parentId` on children zero platform migration, no sign-off needed.
- **Later (optional, gated on proven value):** promote `kind:roadmap` a first-class `epic` tracker `type` via an **additive migration** (backfill items where `labels` contains `kind:roadmap` into `type: epic`, keep the label as an alias during transition). Low-risk because the behavior already works fleet-side.
- **Rationale:** avoids a speculative 9-product platform change (UI/filters/stats/tests) before the orchestration model is validated; if the model is wrong, only fleet code is refactored, not a platform enum every product depends on.
### 24.7 Phasing & gates
- **P1P2:** leaf-only (no composite); `kind` defaults to `leaf`.
- **P3:** composite scheduling + rollup + DAG view in the control plane, with **manual decomposition** (a human/author defines the child set).
- **P3P5:** the **auto-decomposition planner agent** (itself a `composite` job run by the `planner` profile) start manual, automate once trustworthy.
- **Acceptance:** a roadmap with N child jobs fans out across 2 factories, respects `deps` + phase gates, rolls up status/budget correctly, and a re-run skips shipped children; tracker shows the parent moving `in_progress → done` via the one-way echo.
- **Verify gate:** composite-orchestration tests DAG expansion, rollup status/budget, phase-gate blocking, idempotent re-run; control-plane e2e for the roadmap DAG view.
---
## 25. Durability, crash recovery & work preservation
**Goal:** a machine power-off, daemon/agent crash, or network partition **never loses the job, its instructions, or in-progress work**, and never corrupts state. Recovery is automatic and idempotent.
### 25.1 Instructions are durable (markdown in Cosmos)
- [ ] The **full job instruction body is persisted verbatim as markdown** in `fleet_jobs.bodyMd` 13), alongside the structured manifest. The originating tracker `Item.description` also retains the human instruction text; the two are linked by `tracker-item`, never duplicated as competing truth 24.5).
- [ ] A factory only ever holds a **transient materialized copy** (temp prompt file) fetched from the API losing the factory loses nothing. On the offline edge, the `.md` file on disk is the durable copy and reconciles on reconnect 9).
### 25.2 Work-in-progress is preserved (checkpointing)
- [x] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/<jobId>`** at start and on every exit path (success, failure, timeout, signal) partial work is never lost to a crash. Never commits to `main`/protected branches 12 push policy). *(P1-S3: `_wip_start`/`_wip_checkpoint` + EXIT/INT/TERM trap; non-git cwd skipped.)*
- [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it. *(P2 Cosmos; single-host records `wip_branch`/`wip_base`/`wip_commit` in `<job>.meta`.)*
- [x] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window. *(P1-S3: start + every-exit-path commits bound the loss window.)*
### 25.3 Recovery is automatic, resumable, and fenced
- [x] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded. *(P1-S3: `recover_orphans` on `run` startup + each loop, and `agent-queue.sh recover`; dead-pid + `pidstart` reuse guard.)*
- [x] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/<jobId>` exists, the new worker **resumes from the checkpoint** instead of restarting from zero. *(P1-S3: relaunch checks out `aq/wip/<job>`; `attempts` incremented.)*
- [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected no double-execution of *visible* outcomes. *(P2 — distributed leasing; out of single-host scope.)*
- [x] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics never silently dropped. *(P1-S3 single-host.)*
- [x] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** 13); recovery is idempotent (running it twice yields one recovery). *(P1-S3 single-host: meta is append-only + re-derivable from folder location; `_etag` guard is P2.)*
### 25.4 Crash taxonomy (all handled)
| Failure | Detection | Recovery |
| ------- | --------- | -------- |
| Agent process crash (`rc0`) | exit code | retry policy requeue or `failed`/`dead_letter` |
| Daemon/runner crash | lease not renewed | reaper reclaims resume from checkpoint |
| Machine power-off / partition | missed heartbeats + lease expiry | reaper + fencing + WIP resume elsewhere |
| Coordinator restart | state in Cosmos | leases survive; in-flight reconciled on boot |
- **Acceptance:** SIGKILL an agent and power-off a factory mid-run another worker **resumes from the last checkpoint (not from zero)** and ships; instructions intact (read back from Cosmos `bodyMd`); **zero duplicate commits/merges**; a retry-exhausted job lands in `dead_letter`/`failed` with diagnostics.
- **Verify gate:** chaos tests kill agent, kill runner, simulate partition; assert resume-from-checkpoint, fencing rejection of the stale worker, instruction integrity, and no double-merge.
---
## 26. Execution insights & token accounting
**Goal:** per-job/run visibility into **token usage, cost, model, latency, and tool activity** to drive budgets 512), cost burndown 17), and learned routing 14 P5).
- [x] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries. *(P1-S3 single-host: recorded in `<job>.meta` — `duration_s`, `files_changed`/`lines_added`/`lines_deleted`, tokens/cost/turns/tool_calls, `attempts`; CPU time not captured.)*
- [x] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` same caveat as `budget.usd/tokens` 5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction. *(P1-S3: `parse_usage` adapter; generic `AQ_USAGE` line + Claude/Codex heuristics; Devin/Copilot TODO; `usage_estimated` flag, never fabricated.)*
- [ ] **Aggregation/rollups:** per job, roadmap 24), product, factory, engine, profile, and day. Powers cost burndown 17) and the learned-routing eval 14). *(P1-S3 partial: `aq insights` does per-job + per-engine rollup; product/factory/profile/day are P2/P3.)*
- [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present. *(P1-S3 partial: edge CLI `aq insights` + `status`/`dash` insights line done; web control-plane panels are P3.)*
- [x] **Privacy:** telemetry carries metrics + pointers only **never prompt content or secrets** (redaction §12). *(P1-S3: insights/meta record only metrics; no prompt body or secrets added.)*
- **Acceptance:** after a run, its `fleet_runs` carries token/cost/duration/tool/diff metrics (real where metered, flagged `estimated` otherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone.
- **Verify gate:** telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified.
---
*This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.*