bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md
saravanakumardb1 90366e59bb docs(agent-queue): add gigafactory vision + checklist implementation roadmap
- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision
  (factory x tool x profile routing) as a checklist-driven, phased
  implementation roadmap (Phase 0-5) with acceptance criteria, verify
  gates, and a 100% Definition-of-Done rubric
- committed path: coordinator as a platform-service module + control
  plane on tracker-web, reached via a thin tracker adapter first; bash
  runner survives as the offline edge factory agent
- README: add vision/roadmap pointer
2026-05-29 17:06:32 -07:00

462 lines
32 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Gigafactory — Vision & Implementation Roadmap
> **One-liner:** Evolve today's single-host `agent-queue` bash runner into a distributed **gigafactory** — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler **auto-picks jobs from a shared inbox and routes each `.md` to the best factory × tool × profile** — built service-side on `platform-service` + `tracker-web`, with the bash runner surviving as the offline edge agent.
> **How to use this doc:** It is both a PRD and an execution checklist. Every feature is a `- [ ]` checkbox with **acceptance criteria** and a **verify gate**. A phase is "100% done" only when every box is checked, its gate passes, and the phase **Definition of Done** rubric (§16) is green. Update the progress table (§0) as you go.
---
## 0. Progress tracker
| Phase | Theme | Status | % | Gate |
| ----- | ----- | ------ | - | ---- |
| **0** | Baseline (today) | ✅ shipped | 100% | `selftest.sh` green |
| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ☐ not started | 0% | adapter e2e + selftest |
| **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ☐ not started | 0% | fleet e2e + module tests |
| **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests |
| **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite |
| **5** | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B |
Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary.
---
## 1. Vision & metaphor
A **gigafactory** turns raw intent (`.md` task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network:
| Term | Meaning |
| ---- | ------- |
| **Fleet** | The whole network of machines under one control plane. |
| **Factory** | One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity. |
| **Station** | A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent). |
| **Worker** | A single running agent process executing one job at a station. |
| **Job** | A unit of work: a prompt/`.md` + manifest (profile, scope, gates, budget). |
| **Profile** | The *role* doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt **+** capability requirements. |
| **Capability** | A tag a factory advertises and a job requires (`os:mac`, `has:xcode`, `has:figma`, `gpu`, `engine:devin`). |
| **Lease** | A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery). |
| **Gate** | A checkpoint a job must pass: auto-QA `verify`, human review, ship approval. |
| **Artifact** | Any captured output: commits/PRs, logs, screenshots, reports, build outputs. |
**North star:** drop work into one inbox (or file a tracker task), and the fleet figures out *where* (factory), *with what* (tool/engine), *as whom* (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship.
```
┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐
│ plan/intake · roadmap · Fleet map · live logs · cost · approvals │
└───────────────▲───────────────────────────────────┬─────────────────────────┘
│ REST/SSE │
┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐
│ queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos) │
└───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘
│ claim/lease/report │ │ │
┌───────┴───────┐ ┌────────┴───────┐ ┌────────┴───────┐ ┌───────┴────────┐
│ FACTORY: mac │ │ FACTORY: ubuntu│ │FACTORY: windows│ │ FACTORY: mac-2 │
│ devin, claude │ │ codex, claude │ │ copilot, codex │ │ devin (xcode) │
│ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │
└───────────────┘ └────────────────┘ └────────────────┘ └────────────────┘
```
---
## 2. Current state (Phase 0 baseline — already shipped)
Today's `agent-queue.sh` + `dashboard.mjs` (single host, zero-dep bash + Node):
- **Folder kanban lifecycle:** `inbox → building → review → testing → shipped` (+ `failed`).
- **Auto-QA gate:** agent rc=0 → `review/`; optional `verify:` runs in `cwd` → pass `testing/`, fail `failed/`; no verify → parks in `review/`. Manual `ship` = the human gate.
- **Per-job frontmatter:** `engine` (devin/claude/codex), `cwd`, `yolo` (→ dangerous/auto-approve), `lock` (per-repo serialization), `timeout`, `verify`.
- **Concurrency:** `AGENT_QUEUE_MAX` (default 3), per-`lock` serialization so same-repo jobs never collide.
- **State & logs:** `.state/<job>.meta` heartbeats + `logs/<job>.log`; git-tracked queue (audit-by-commit).
- **Interactive dashboard:** numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to `agent-queue.sh`.
**Carries forward:** the `.md`-in-`inbox` UX, frontmatter contract, lifecycle stage names, `verify` gate, lock/affinity concept, the bash runner itself (becomes the factory agent).
**Must change for the fleet:** single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane.
- [x] Phase 0 complete — baseline shipped and self-tested. *(reference, not a work item)*
---
## 3. Goals & non-goals
**Goals**
- One intake, many machines: parallel execution across heterogeneous OS/tools.
- Automatic routing to the best `factory × tool × profile` with affinity, fairness, budget, and health awareness.
- Self-healing (lease expiry/requeue), quality gates, and full observability.
- Reuse the ByteLyst stack (`platform-service`, Cosmos, `@bytelyst/*`, tracker-web) — no parallel infra.
- Preserve offline/zero-dep edge operation via the bash runner.
**Non-goals**
- Not a CI/CD replacement (it *triggers* CI; CI still gates merges).
- Not a general-purpose workflow engine (scoped to coding-agent execution).
- Not a model/inference host (it orchestrates agent CLIs, doesn't serve models).
- Not abandoning the simple `.md` mental model — humans still drop files / file tasks.
---
## 4. Core concepts contract (must hold across all phases)
- [ ] Every job has a stable **id**, an immutable **manifest**, and an append-only **event log**.
- [ ] Every Cosmos document carries `productId` (ByteLyst rule).
- [ ] A job in flight is always covered by exactly one **lease**; no lease → reclaimable.
- [ ] Lifecycle stages are canonical and shared: `queued → assigned → building → review → testing → shipped` (+ `failed`, `dead_letter`).
- [ ] The bash runner and the service speak the **same manifest + event vocabulary** (one schema, two transports).
---
## 5. The evolved Job manifest (feature)
Extend today's frontmatter into a richer, **backward-compatible** manifest. Old `.md` files keep working (new fields optional with sane defaults).
```yaml
---
# --- existing (unchanged) ---
engine: devin # explicit engine; overrides profile/engine-class
cwd: /abs/path/repo
yolo: true
lock: my-repo
timeout: 45m
verify: pnpm -s test
# --- new ---
profile: backend-engineer # role: persona + capability requirements
engine-class: agentic-coder # abstract; scheduler picks a concrete engine if `engine` unset
capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy
prefers: [factory:mac-2] # soft routing hints (affinity)
priority: high # critical|high|medium|low → SLA + preemption
budget: { usd: 5, tokens: 2M, wall: 4h } # hard ceilings; exceed → pause/fail
deps: [job-123, job-456] # DAG: don't start until these reach `shipped`/`testing`
idempotency-key: nomgap-ux-2 # dedupe: a second identical submit is a no-op
retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] }
review-policy: manual # auto|manual|reviewers:[@alice]
artifacts: [coverage, screenshots] # what to capture beyond commits
tracker-item: ITEM-789 # link back to the originating tracker task
---
```
- [ ] Define the manifest schema (Zod in the service; documented YAML for `.md`).
- [ ] Backward-compat: a Phase-0 `.md` (only `engine/cwd/yolo`) parses with all new fields defaulted.
- [ ] `idempotency-key` dedupe semantics specified (same key + same content hash = no-op).
- [ ] `deps` DAG semantics specified (blocked state, cycle detection, fan-in/out).
- **Acceptance:** a manifest fixture suite parses/validates; invalid manifests fail with precise errors.
- **Verify gate:** schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases).
---
## 6. Profiles — persona + capability (feature)
A **profile** = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as `profiles/<name>.md` (Phase 1) → Cosmos `profiles` container (Phase 2).
```yaml
# profiles/backend-engineer.md
---
name: backend-engineer
persona: |
You are a senior backend engineer. Favor minimal, well-tested changes...
capabilities: [node>=20, has:pnpm]
default-verify: pnpm -s typecheck && pnpm -s test
engine-class: agentic-coder
prefers-engine: [devin, claude]
allowed-scope: ["backend/**", "packages/**"] # blast-radius guardrail
review-policy: manual
---
```
- [ ] Author starter catalog: `developer`, `backend-engineer`, `frontend-engineer`, `ux-designer`, `ui-designer`, `qa`, `reviewer`, `docs-writer`.
- [ ] Persona overlay is **prepended** to the job body before the agent runs (and stripped from logs of secrets).
- [ ] Profile supplies default `verify`, `capabilities`, `engine-class`, `allowed-scope` when the job omits them.
- [ ] Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time).
- [ ] `allowed-scope` enforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check).
- **Acceptance:** a job with `profile: backend-engineer` and no `verify` inherits the profile's verify + persona.
- **Verify gate:** profile-resolution unit tests; persona-injection golden test.
---
## 7. The scheduler / router (the heart) (feature)
Given a `queued` job and the current fleet, choose `(factory, station/engine, profile)` and issue a lease.
**Inputs:** job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters.
**Algorithm (deterministic, explainable):**
1. **Filter** factories by **hard capability match** (job profile capabilities ⊆ factory capabilities) and free station for a compatible engine.
2. **Block** if `deps` unmet or `lock` already held → leave `queued`/`blocked`.
3. **Score** each candidate factory:
`score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health w6·starvationPenalty`
4. **Tie-break:** highest priority job first; then oldest; then lowest cost class.
5. **Assign** → write lease (TTL), set job `assigned`, decrement station capacity, bump fairness counter.
6. **Preemption (P3+):** a `critical` job may pause a `low` job at a needed station (checkpoint + requeue).
- [ ] Implement pure, unit-testable scoring function (no I/O) with configurable weights.
- [ ] Hard-filter correctness: never assign a job to a factory missing a required capability.
- [ ] Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware).
- [ ] Fairness: no factory or product starves under sustained load (counter + penalty).
- [ ] Explainability: every assignment records *why* (matched caps, score breakdown) in the event log.
- [ ] Determinism: same inputs → same decision (seeded tie-breaks) for testability.
- **Acceptance:** scenario fixtures (10+) produce expected assignments incl. starvation + capability-miss + budget-exceed.
- **Verify gate:** router unit suite ≥ 95% branch coverage on the scoring/filter core.
---
## 8. Factory model & registration (feature)
Each machine runs a **factory agent** (the evolved `agent-queue` runner) that registers, heartbeats, claims jobs, and reports events.
- [ ] **Capability auto-detection** at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
- [ ] **Registration**: `POST /fleet/factories` with descriptor → receives a factory id + token.
- [ ] **Heartbeat**: periodic `PUT /fleet/factories/:id/heartbeat` (load, free stations, health); missed N → factory marked `offline`, its leases reclaimed.
- [ ] **Claim loop**: `POST /fleet/leases/claim` advertising capabilities/free stations; receives a job + lease TTL.
- [ ] **Report**: stream stage/log/event back (`POST /fleet/runs/:id/events`); renew lease while alive.
- [ ] **Graceful drain**: factory can stop claiming, finish in-flight, deregister.
- **Acceptance:** a factory registers, claims a matching job, heartbeats, completes, and a killed factory's job is reclaimed by another within the lease TTL.
- **Verify gate:** factory-agent integration test against a mock coordinator; crash-recovery test.
---
## 9. Coordination architecture (decision + path)
Three transports were evaluated. **Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.**
| Option | Pros | Cons | Verdict |
| ------ | ---- | ---- | ------- |
| (a) **Git-synced queue** (evolve folders) | zero infra, audit-by-commit, offline | weak/racey leasing, latency, merge churn | **Edge/offline only** |
| (b) **Coordinator service** (platform-service module) | real leases, fairness, observability, reuses auth/Cosmos/productId | a service to run | **Chosen spine (P2)** |
| (c) **Message broker** (NATS/Redis/SQS) | scale, backpressure, push dispatch | most moving parts/ops | **P4 when throughput demands** |
- [ ] Document the decision + rationale in-repo (this section is the canonical record).
- [ ] Define the **claim/lease protocol** once; both git-queue (poll) and service (API) implement it.
- [ ] Offline-degrade: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect (idempotency-key prevents double-execution).
- **Acceptance:** the same job manifest runs identically through the bash/git path and the service path.
- **Verify gate:** contract test asserting protocol parity (git vs service).
---
## 10. tracker-web / platform-service integration (committed path)
**Layering:** tracker = *WHAT/WHY* (plan, intake, prioritize, roadmap, votes) · gigafactory = *HOW* (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real `tracker-service` model (`Item`: `type` bug/feature/**task**, `status` open/in_progress/done/closed/wont_fix, priority, labels, assignee, `source` incl. **auto_detected**, votes, comments, public roadmap) and the `tracker-web` `/api/tracker/[...path]` proxy pattern.
### Phase 1 — Adapter (no new infra)
- [ ] **task → job**: a tracker `Item` of `type: task` (e.g. `assignee: @agent` or label `agent:run`) is exported to a job `.md` (manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints).
- [ ] **job → tracker**: lifecycle events post back as **status updates + comments**`building` → status `in_progress` + comment "started on factory X"; `shipped``done` + comment with commit SHAs / PR link / verify results; `failed` → comment with reason (status stays `in_progress` for human triage).
- [ ] Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash).
- [ ] Adapter is a thin script/CLI (`aq from-tracker ITEM-789`) + optional poller.
- **Acceptance:** filing a tracker task, marking it `agent:run`, results in a queued job; on ship, the item flips to `done` with a SHA comment.
- **Verify gate:** adapter e2e against a tracker-service test instance (or mock); round-trip assertion.
### Phase 2 — Native spine
- [ ] Stand up a `fleet` (a.k.a. `orchestrator`) module **inside platform-service**, sibling to `tracker-service`: pattern `types.ts → repository.ts → routes.ts`, ESM, Cosmos, `productId`, `req.log`.
- [ ] Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats.
- [ ] Runners (bash + any) become API clients of this module; tracker adapter calls it directly.
- **Acceptance:** a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos.
- **Verify gate:** module test suite (repository + routes) using the shared `@bytelyst/testing` inject helpers.
### Phase 3 — Unified control plane
- [ ] Add a **Fleet** surface to `tracker-web` reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, **live log streaming (SSE)**, lease/heartbeat status, cost burndown, approve/ship buttons.
- [ ] The Node TUI dashboard becomes a thin client of the same `/fleet` API (parity with web).
- **Acceptance:** an operator can watch all factories + tail any job log + ship from the browser.
- **Verify gate:** web e2e (Playwright) covering fleet map render, live log, and a ship action.
---
## 11. Lifecycle & gates at scale (feature)
- [ ] Canonical stages enforced server-side: `queued → assigned → building → review → testing → shipped` (+ `failed`, `dead_letter`).
- [ ] Per-profile default `verify`; per-job override; verify runs at the factory, result reported as an event.
- [ ] Human gates: `review-policy` routes to reviewers; multi-reviewer support (P3).
- [ ] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped.
- [ ] **Backpressure**: when no factory can take more, jobs stay `queued` (no thrash); SLA timers visible.
- **Acceptance:** a perpetually-failing job lands in `dead_letter` after configured retries; a passing one auto-advances to `testing` then waits for human `ship`.
- **Verify gate:** lifecycle state-machine unit tests (all transitions + illegal-transition rejection).
---
## 12. Security, safety & governance (feature — critical with `yolo`/dangerous)
- [ ] **Secret isolation**: creds live on each factory (env/keychain), **never** in the queue, manifest, logs, or Cosmos. Factory advertises *presence* of a cred capability, not the value.
- [ ] **Scoped git tokens** per factory/repo; least-privilege; rotation documented.
- [ ] **Push policy**: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag.
- [ ] **Blast-radius guardrail**: enforce `allowed-scope` — pre-flight + post-run diff check; out-of-scope changes block the ship gate.
- [ ] **Budget kill-switch**: exceed `budget` (usd/tokens/wall) → pause worker, alert, require human resume.
- [ ] **Supply-chain safety**: edits to shared `@bytelyst/*` packages require `reviewer` profile + human gate (never auto-ship).
- [ ] **Audit trail**: append-only event log per job (who/what/when/where/cost); immutable.
- [ ] **Corp network/proxy**: honor `NETWORK`/proxy + truststore conventions on factories that need them.
- [ ] **Kill switch (global)**: one command/flag halts all claiming fleet-wide (incident response).
- **Acceptance:** a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test).
- **Verify gate:** security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test.
---
## 13. Data model (Cosmos containers, P2+)
Each container partitioned sensibly; every doc has `productId`.
- [ ] `fleet_jobs` (pk `/productId`) — manifest snapshot, current stage, idempotency-key, tracker-item link.
- [ ] `fleet_runs` (pk `/jobId`) — one per execution attempt: factory, engine, profile snapshot, timings, cost, exit, verify result.
- [ ] `fleet_leases` (pk `/jobId`) — holder factory, TTL, renewals; TTL index for auto-expiry.
- [ ] `fleet_factories` (pk `/productId`) — descriptor, capabilities, health, load, last heartbeat.
- [ ] `fleet_profiles` (pk `/productId`) — versioned profile snapshots.
- [ ] `fleet_events` (pk `/jobId`) — append-only audit/event stream (stage changes, logs ptr, cost ticks, decisions).
- [ ] Relate to existing tracker `Item` via `tracker-item` (no duplication of planning data).
- **Acceptance:** repository CRUD + query tests per container; lease TTL expiry verified.
- **Verify gate:** repository unit/integration tests (memory + Cosmos provider via `DB_PROVIDER`).
---
## 14. Phased build roadmap (checklists)
Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress.
### Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)
**Goal:** richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.
- [ ] Extend `agent-queue.sh` frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible.
- [ ] Add `profiles/` directory + profile resolution (persona injection, default verify/caps/scope) (§6).
- [ ] Local capability detection + a job/factory capability match check before launch (§8 subset).
- [ ] `priority` ordering in the inbox pick (replace pure FIFO with priority-then-age).
- [ ] `deps` (DAG) blocking on a single host; `idempotency-key` dedupe on `add`.
- [ ] `retry` with backoff into `failed`/requeue; `budget.wall` enforced (extends `timeout`).
- [ ] `allowed-scope` guardrail (warn-only this phase) + post-run diff report.
- [ ] **Tracker adapter** `aq from-tracker <ITEM>` + `aq to-tracker` event poster (§10 P1).
- [ ] Dashboard shows profile + priority + capability tags + tracker-item link.
- [ ] Update `selftest.sh` with: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock).
- [ ] Update README + this doc's progress table.
- **Exit criteria:** all boxes ✅; `selftest.sh` green; a tracker task → executed → tracker `done` with SHA comment, fully on one host; no regression to Phase-0 `.md` files.
### Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing
**Goal:** the service spine; ≥2 real factories executing in parallel via leases.
- [ ] Scaffold `fleet`/`orchestrator` module in `platform-service` (`types/repository/routes`, Zod, ESM, `productId`).
- [ ] Cosmos containers (§13) + repository layer (memory + Cosmos providers).
- [ ] Claim/lease protocol endpoints + TTL expiry + reclaim (§8, §9).
- [ ] Port `agent-queue` runner to a **factory agent** API client (register/heartbeat/claim/report) while keeping git-queue fallback.
- [ ] Scheduler/router core (§7) as a pure module + wired into assignment.
- [ ] Tracker adapter calls the module directly (not just file export).
- [ ] Auth: factory tokens; scoped; secret isolation enforced (§12 subset).
- [ ] Module test suite (repository + routes via `@bytelyst/testing`); crash-recovery + lease-expiry tests.
- [ ] Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end.
- **Exit criteria:** all boxes ✅; `pnpm --filter @lysnrai/platform-service test` green; killing a factory mid-job → another reclaims and completes; all state in Cosmos with `productId`.
### Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router
**Goal:** one browser control plane; smart routing + budgets live.
- [ ] `fleet` API client in `tracker-web` (reuse `/api/tracker`-style proxy → `/fleet`).
- [ ] Fleet map page (factories, load, health, capabilities) on `@bytelyst/*` components.
- [ ] Job table + job detail + **DAG view**; live log via **SSE**; approve/ship/reject/requeue actions.
- [ ] Cost burndown + budget kill-switch UI; multi-reviewer routing.
- [ ] Scoring router with configurable weights + explainability surfaced in UI.
- [ ] Preemption of low-priority by critical jobs (checkpoint + requeue).
- [ ] TUI dashboard re-pointed at `/fleet` API (parity).
- [ ] Web e2e (Playwright): fleet map, live log, ship, budget-pause.
- **Exit criteria:** all boxes ✅; web `verify` (typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume.
### Phase 4 — Message bus + autoscaling + cross-OS capability marketplace
**Goal:** scale-out and elasticity.
- [ ] Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability.
- [ ] Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA.
- [ ] Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products.
- [ ] Load + chaos test suite (factory churn, broker outage, thundering herd).
- **Exit criteria:** all boxes ✅; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll).
### Phase 5 — Self-optimizing / learned routing
**Goal:** the scheduler learns from history to cut time/cost and raise first-pass success.
- [ ] Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate).
- [ ] Offline eval harness comparing learned vs heuristic routing on historical data.
- [ ] Shadow/A-B rollout with guardrails; auto-tune scoring weights.
- [ ] Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper").
- **Exit criteria:** all boxes ✅; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric.
---
## 15. Cross-cutting feature catalog (quick index)
| Feature | First phase | Section |
| ------- | ----------- | ------- |
| Evolved job manifest | P1 | §5 |
| Profiles (persona + capability) | P1 | §6 |
| Capability matching | P1→P2 | §6/§8 |
| Priority + SLA | P1 | §5/§7 |
| DAG dependencies | P1→P3 | §5/§11 |
| Idempotency / dedupe | P1 | §5 |
| Retry + dead-letter | P1→P2 | §11 |
| Budgets + kill-switch | P1(wall)→P3 | §5/§12 |
| Scheduler/router scoring | P2→P3 | §7 |
| Factory registration/heartbeat/lease | P2 | §8 |
| Coordinator (platform-service module) | P2 | §9/§10 |
| Cosmos data model | P2 | §13 |
| Tracker bi-directional sync | P1→P2 | §10 |
| Web control plane + SSE logs | P3 | §10/§17 |
| Security/scope/secret isolation | P1→P2 | §12 |
| Broker + autoscaling | P4 | §14 |
| Learned routing | P5 | §14 |
---
## 16. Definition of Done — the "100% accuracy" rubric
A feature/phase is **not done** until **every** item below is true (this is the bar for "100% end-to-end"):
- [ ] **Functionality**: acceptance criteria met; happy path + documented edge cases handled.
- [ ] **Tests**: unit + integration written *first or alongside*, all green; no weakened/deleted tests; coverage targets met (router ≥95% core).
- [ ] **Verify gate**: the phase's named gate command passes locally (and in CI where applicable).
- [ ] **Idempotency & recovery**: re-runs are safe; crash mid-step recovers (lease/idempotency).
- [ ] **Security review**: secret-leak scan clean; scope guardrail honored; least-privilege tokens.
- [ ] **Observability**: events/logs/metrics emitted; failures are diagnosable from the control plane.
- [ ] **Docs**: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current.
- [ ] **Backward-compat**: existing `.md`/Phase-0 behavior unbroken (regression check).
- [ ] **Drift checks**: shared-infra templates (`.npmrc`, `docker-prep`) untouched/synced; conventional commits.
- [ ] **No `console.log`/`print`** in service code; `req.log`/`os.Logger` used; ESM `.js` imports.
---
## 17. Observability & control plane details
- [ ] **Live logs** via SSE from factory → coordinator → web/TUI (single stream contract).
- [ ] **Metrics**: queue depth, assign latency, run duration, verify pass-rate, cost, factory utilization, fairness.
- [ ] **Alerting**: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter.
- [ ] **Tracing**: a job's full timeline (queued→…→shipped) reconstructable from `fleet_events`.
- [ ] **Cost burndown** per job/product/day with budget overlays.
---
## 18. Risks & gaps explicitly tracked (expert call-outs)
- [ ] **Duplicate execution** across transports (git fallback + service) — mitigated by `idempotency-key` + lease.
- [ ] **Crash recovery** — lease TTL + reclaim; checkpoint long jobs where engines allow.
- [ ] **Shared-package conflicts** — two jobs editing `@bytelyst/*` simultaneously → lock + reviewer gate.
- [ ] **Starvation/fairness** — per-product + per-factory counters with penalty.
- [ ] **Cost runaway** — hard budgets + global kill switch.
- [ ] **Tool-version drift / reproducibility** — record engine + tool versions per run; pin where possible.
- [ ] **Windows quirks** — path/shell differences in the factory agent; capability-gate Windows-only work.
- [ ] **Human-review bottleneck** — auto-verify as much as possible; batch review UI; reviewer routing.
- [ ] **Result capture beyond commits** — artifacts (coverage, screenshots, build logs) attached to runs.
- [ ] **Secret sprawl** — never in queue/manifest/logs/Cosmos; presence-only capabilities.
- [ ] **Data retention** — event/log retention + archival policy (extend today's `clean`).
- [ ] **Engine API churn** — engines mapped in one place (`build_agent_cmd`); capability matrix versioned.
---
## 19. Success metrics
- Throughput: jobs shipped/day; parallel utilization (% of fleet busy).
- Quality: % auto-verified, first-pass success rate, escaped-defect rate, human-edit rate post-agent.
- Speed: mean time queued→shipped; assign latency.
- Cost: $/shipped job; budget-breach rate.
- Reliability: lease-reclaim success, dead-letter rate, factory uptime.
- Fairness: max/min product wait-time ratio.
---
## 20. Open questions
- [ ] Copilot headless feasibility as an engine/station (CLI/automation surface?).
- [ ] Who owns merge/push authority — agents open PRs only, or auto-merge on green for low-risk profiles?
- [ ] Multi-user/tenant: per-user queues + RBAC in the control plane?
- [ ] On-call/ownership for the fleet (alerts routing, runbooks)?
- [ ] Cloud factory provisioning (Phase 4) — which provider/runtime, cost guardrails?
- [ ] Profile authorship/governance — who can create/edit profiles, and review of persona prompts?
---
*This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.*