bytelyst-devops-tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md
saravanakumardb1 90366e59bb docs(agent-queue): add gigafactory vision + checklist implementation roadmap
- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision
  (factory x tool x profile routing) as a checklist-driven, phased
  implementation roadmap (Phase 0-5) with acceptance criteria, verify
  gates, and a 100% Definition-of-Done rubric
- committed path: coordinator as a platform-service module + control
  plane on tracker-web, reached via a thin tracker adapter first; bash
  runner survives as the offline edge factory agent
- README: add vision/roadmap pointer
2026-05-29 17:06:32 -07:00

32 KiB
Raw Blame History

Agent Gigafactory — Vision & Implementation Roadmap

One-liner: Evolve today's single-host agent-queue bash runner into a distributed gigafactory — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler auto-picks jobs from a shared inbox and routes each .md to the best factory × tool × profile — built service-side on platform-service + tracker-web, with the bash runner surviving as the offline edge agent.

How to use this doc: It is both a PRD and an execution checklist. Every feature is a - [ ] checkbox with acceptance criteria and a verify gate. A phase is "100% done" only when every box is checked, its gate passes, and the phase Definition of Done rubric (§16) is green. Update the progress table (§0) as you go.


0. Progress tracker

Phase Theme Status % Gate
0 Baseline (today) shipped 100% selftest.sh green
1 Manifest + profiles + capabilities + tracker adapter (single host) ☐ not started 0% adapter e2e + selftest
2 Coordinator as platform-service module + Cosmos + multi-factory leasing ☐ not started 0% fleet e2e + module tests
3 Fleet control plane in tracker-web + DAG deps + budgets + scoring router ☐ not started 0% web e2e + router tests
4 Message bus + autoscaling + cross-OS capability marketplace ☐ not started 0% load/chaos suite
5 Self-optimizing / learned routing ☐ not started 0% offline eval + A/B

Legend: ☐ not started · ◐ in progress · done. Keep per-phase checklists below as the source of truth; this table is the summary.


1. Vision & metaphor

A gigafactory turns raw intent (.md task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network:

Term Meaning
Fleet The whole network of machines under one control plane.
Factory One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity.
Station A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent).
Worker A single running agent process executing one job at a station.
Job A unit of work: a prompt/.md + manifest (profile, scope, gates, budget).
Profile The role doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt + capability requirements.
Capability A tag a factory advertises and a job requires (os:mac, has:xcode, has:figma, gpu, engine:devin).
Lease A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery).
Gate A checkpoint a job must pass: auto-QA verify, human review, ship approval.
Artifact Any captured output: commits/PRs, logs, screenshots, reports, build outputs.

North star: drop work into one inbox (or file a tracker task), and the fleet figures out where (factory), with what (tool/engine), as whom (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship.

                         ┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐
                         │  plan/intake · roadmap · Fleet map · live logs · cost · approvals           │
                         └───────────────▲───────────────────────────────────┬─────────────────────────┘
                                         │ REST/SSE                           │
            ┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐
            │  queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos)  │
            └───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘
                │ claim/lease/report     │                       │                       │
        ┌───────┴───────┐       ┌────────┴───────┐       ┌────────┴───────┐       ┌───────┴────────┐
        │  FACTORY: mac │       │ FACTORY: ubuntu│       │FACTORY: windows│       │ FACTORY: mac-2 │
        │ devin, claude │       │ codex, claude  │       │ copilot, codex │       │ devin (xcode)  │
        │ [agent-queue] │       │ [agent-queue]  │       │ [agent-queue]  │       │ [agent-queue]  │
        └───────────────┘       └────────────────┘       └────────────────┘       └────────────────┘

2. Current state (Phase 0 baseline — already shipped)

Today's agent-queue.sh + dashboard.mjs (single host, zero-dep bash + Node):

  • Folder kanban lifecycle: inbox → building → review → testing → shipped (+ failed).
  • Auto-QA gate: agent rc=0 → review/; optional verify: runs in cwd → pass testing/, fail failed/; no verify → parks in review/. Manual ship = the human gate.
  • Per-job frontmatter: engine (devin/claude/codex), cwd, yolo (→ dangerous/auto-approve), lock (per-repo serialization), timeout, verify.
  • Concurrency: AGENT_QUEUE_MAX (default 3), per-lock serialization so same-repo jobs never collide.
  • State & logs: .state/<job>.meta heartbeats + logs/<job>.log; git-tracked queue (audit-by-commit).
  • Interactive dashboard: numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to agent-queue.sh.

Carries forward: the .md-in-inbox UX, frontmatter contract, lifecycle stage names, verify gate, lock/affinity concept, the bash runner itself (becomes the factory agent). Must change for the fleet: single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane.

  • Phase 0 complete — baseline shipped and self-tested. (reference, not a work item)

3. Goals & non-goals

Goals

  • One intake, many machines: parallel execution across heterogeneous OS/tools.
  • Automatic routing to the best factory × tool × profile with affinity, fairness, budget, and health awareness.
  • Self-healing (lease expiry/requeue), quality gates, and full observability.
  • Reuse the ByteLyst stack (platform-service, Cosmos, @bytelyst/*, tracker-web) — no parallel infra.
  • Preserve offline/zero-dep edge operation via the bash runner.

Non-goals

  • Not a CI/CD replacement (it triggers CI; CI still gates merges).
  • Not a general-purpose workflow engine (scoped to coding-agent execution).
  • Not a model/inference host (it orchestrates agent CLIs, doesn't serve models).
  • Not abandoning the simple .md mental model — humans still drop files / file tasks.

4. Core concepts contract (must hold across all phases)

  • Every job has a stable id, an immutable manifest, and an append-only event log.
  • Every Cosmos document carries productId (ByteLyst rule).
  • A job in flight is always covered by exactly one lease; no lease → reclaimable.
  • Lifecycle stages are canonical and shared: queued → assigned → building → review → testing → shipped (+ failed, dead_letter).
  • The bash runner and the service speak the same manifest + event vocabulary (one schema, two transports).

5. The evolved Job manifest (feature)

Extend today's frontmatter into a richer, backward-compatible manifest. Old .md files keep working (new fields optional with sane defaults).

---
# --- existing (unchanged) ---
engine: devin            # explicit engine; overrides profile/engine-class
cwd: /abs/path/repo
yolo: true
lock: my-repo
timeout: 45m
verify: pnpm -s test
# --- new ---
profile: backend-engineer        # role: persona + capability requirements
engine-class: agentic-coder      # abstract; scheduler picks a concrete engine if `engine` unset
capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy
prefers: [factory:mac-2]         # soft routing hints (affinity)
priority: high                   # critical|high|medium|low → SLA + preemption
budget: { usd: 5, tokens: 2M, wall: 4h }   # hard ceilings; exceed → pause/fail
deps: [job-123, job-456]         # DAG: don't start until these reach `shipped`/`testing`
idempotency-key: nomgap-ux-2     # dedupe: a second identical submit is a no-op
retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] }
review-policy: manual            # auto|manual|reviewers:[@alice]
artifacts: [coverage, screenshots]   # what to capture beyond commits
tracker-item: ITEM-789           # link back to the originating tracker task
---
  • Define the manifest schema (Zod in the service; documented YAML for .md).
  • Backward-compat: a Phase-0 .md (only engine/cwd/yolo) parses with all new fields defaulted.
  • idempotency-key dedupe semantics specified (same key + same content hash = no-op).
  • deps DAG semantics specified (blocked state, cycle detection, fan-in/out).
  • Acceptance: a manifest fixture suite parses/validates; invalid manifests fail with precise errors.
  • Verify gate: schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases).

6. Profiles — persona + capability (feature)

A profile = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as profiles/<name>.md (Phase 1) → Cosmos profiles container (Phase 2).

# profiles/backend-engineer.md
---
name: backend-engineer
persona: |
  You are a senior backend engineer. Favor minimal, well-tested changes...  
capabilities: [node>=20, has:pnpm]
default-verify: pnpm -s typecheck && pnpm -s test
engine-class: agentic-coder
prefers-engine: [devin, claude]
allowed-scope: ["backend/**", "packages/**"]   # blast-radius guardrail
review-policy: manual
---
  • Author starter catalog: developer, backend-engineer, frontend-engineer, ux-designer, ui-designer, qa, reviewer, docs-writer.
  • Persona overlay is prepended to the job body before the agent runs (and stripped from logs of secrets).
  • Profile supplies default verify, capabilities, engine-class, allowed-scope when the job omits them.
  • Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time).
  • allowed-scope enforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check).
  • Acceptance: a job with profile: backend-engineer and no verify inherits the profile's verify + persona.
  • Verify gate: profile-resolution unit tests; persona-injection golden test.

7. The scheduler / router (the heart) (feature)

Given a queued job and the current fleet, choose (factory, station/engine, profile) and issue a lease.

Inputs: job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters.

Algorithm (deterministic, explainable):

  1. Filter factories by hard capability match (job profile capabilities ⊆ factory capabilities) and free station for a compatible engine.
  2. Block if deps unmet or lock already held → leave queued/blocked.
  3. Score each candidate factory: score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health w6·starvationPenalty
  4. Tie-break: highest priority job first; then oldest; then lowest cost class.
  5. Assign → write lease (TTL), set job assigned, decrement station capacity, bump fairness counter.
  6. Preemption (P3+): a critical job may pause a low job at a needed station (checkpoint + requeue).
  • Implement pure, unit-testable scoring function (no I/O) with configurable weights.
  • Hard-filter correctness: never assign a job to a factory missing a required capability.
  • Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware).
  • Fairness: no factory or product starves under sustained load (counter + penalty).
  • Explainability: every assignment records why (matched caps, score breakdown) in the event log.
  • Determinism: same inputs → same decision (seeded tie-breaks) for testability.
  • Acceptance: scenario fixtures (10+) produce expected assignments incl. starvation + capability-miss + budget-exceed.
  • Verify gate: router unit suite ≥ 95% branch coverage on the scoring/filter core.

8. Factory model & registration (feature)

Each machine runs a factory agent (the evolved agent-queue runner) that registers, heartbeats, claims jobs, and reports events.

  • Capability auto-detection at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
  • Registration: POST /fleet/factories with descriptor → receives a factory id + token.
  • Heartbeat: periodic PUT /fleet/factories/:id/heartbeat (load, free stations, health); missed N → factory marked offline, its leases reclaimed.
  • Claim loop: POST /fleet/leases/claim advertising capabilities/free stations; receives a job + lease TTL.
  • Report: stream stage/log/event back (POST /fleet/runs/:id/events); renew lease while alive.
  • Graceful drain: factory can stop claiming, finish in-flight, deregister.
  • Acceptance: a factory registers, claims a matching job, heartbeats, completes, and a killed factory's job is reclaimed by another within the lease TTL.
  • Verify gate: factory-agent integration test against a mock coordinator; crash-recovery test.

9. Coordination architecture (decision + path)

Three transports were evaluated. Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.

Option Pros Cons Verdict
(a) Git-synced queue (evolve folders) zero infra, audit-by-commit, offline weak/racey leasing, latency, merge churn Edge/offline only
(b) Coordinator service (platform-service module) real leases, fairness, observability, reuses auth/Cosmos/productId a service to run Chosen spine (P2)
(c) Message broker (NATS/Redis/SQS) scale, backpressure, push dispatch most moving parts/ops P4 when throughput demands
  • Document the decision + rationale in-repo (this section is the canonical record).
  • Define the claim/lease protocol once; both git-queue (poll) and service (API) implement it.
  • Offline-degrade: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect (idempotency-key prevents double-execution).
  • Acceptance: the same job manifest runs identically through the bash/git path and the service path.
  • Verify gate: contract test asserting protocol parity (git vs service).

10. tracker-web / platform-service integration (committed path)

Layering: tracker = WHAT/WHY (plan, intake, prioritize, roadmap, votes) · gigafactory = HOW (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real tracker-service model (Item: type bug/feature/task, status open/in_progress/done/closed/wont_fix, priority, labels, assignee, source incl. auto_detected, votes, comments, public roadmap) and the tracker-web /api/tracker/[...path] proxy pattern.

Phase 1 — Adapter (no new infra)

  • task → job: a tracker Item of type: task (e.g. assignee: @agent or label agent:run) is exported to a job .md (manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints).
  • job → tracker: lifecycle events post back as status updates + commentsbuilding → status in_progress + comment "started on factory X"; shippeddone + comment with commit SHAs / PR link / verify results; failed → comment with reason (status stays in_progress for human triage).
  • Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash).
  • Adapter is a thin script/CLI (aq from-tracker ITEM-789) + optional poller.
  • Acceptance: filing a tracker task, marking it agent:run, results in a queued job; on ship, the item flips to done with a SHA comment.
  • Verify gate: adapter e2e against a tracker-service test instance (or mock); round-trip assertion.

Phase 2 — Native spine

  • Stand up a fleet (a.k.a. orchestrator) module inside platform-service, sibling to tracker-service: pattern types.ts → repository.ts → routes.ts, ESM, Cosmos, productId, req.log.
  • Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats.
  • Runners (bash + any) become API clients of this module; tracker adapter calls it directly.
  • Acceptance: a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos.
  • Verify gate: module test suite (repository + routes) using the shared @bytelyst/testing inject helpers.

Phase 3 — Unified control plane

  • Add a Fleet surface to tracker-web reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, live log streaming (SSE), lease/heartbeat status, cost burndown, approve/ship buttons.
  • The Node TUI dashboard becomes a thin client of the same /fleet API (parity with web).
  • Acceptance: an operator can watch all factories + tail any job log + ship from the browser.
  • Verify gate: web e2e (Playwright) covering fleet map render, live log, and a ship action.

11. Lifecycle & gates at scale (feature)

  • Canonical stages enforced server-side: queued → assigned → building → review → testing → shipped (+ failed, dead_letter).
  • Per-profile default verify; per-job override; verify runs at the factory, result reported as an event.
  • Human gates: review-policy routes to reviewers; multi-reviewer support (P3).
  • Dead-letter: after retry.max exhausted, job → dead_letter with full diagnostics; never silently dropped.
  • Backpressure: when no factory can take more, jobs stay queued (no thrash); SLA timers visible.
  • Acceptance: a perpetually-failing job lands in dead_letter after configured retries; a passing one auto-advances to testing then waits for human ship.
  • Verify gate: lifecycle state-machine unit tests (all transitions + illegal-transition rejection).

12. Security, safety & governance (feature — critical with yolo/dangerous)

  • Secret isolation: creds live on each factory (env/keychain), never in the queue, manifest, logs, or Cosmos. Factory advertises presence of a cred capability, not the value.
  • Scoped git tokens per factory/repo; least-privilege; rotation documented.
  • Push policy: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag.
  • Blast-radius guardrail: enforce allowed-scope — pre-flight + post-run diff check; out-of-scope changes block the ship gate.
  • Budget kill-switch: exceed budget (usd/tokens/wall) → pause worker, alert, require human resume.
  • Supply-chain safety: edits to shared @bytelyst/* packages require reviewer profile + human gate (never auto-ship).
  • Audit trail: append-only event log per job (who/what/when/where/cost); immutable.
  • Corp network/proxy: honor NETWORK/proxy + truststore conventions on factories that need them.
  • Kill switch (global): one command/flag halts all claiming fleet-wide (incident response).
  • Acceptance: a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test).
  • Verify gate: security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test.

13. Data model (Cosmos containers, P2+)

Each container partitioned sensibly; every doc has productId.

  • fleet_jobs (pk /productId) — manifest snapshot, current stage, idempotency-key, tracker-item link.
  • fleet_runs (pk /jobId) — one per execution attempt: factory, engine, profile snapshot, timings, cost, exit, verify result.
  • fleet_leases (pk /jobId) — holder factory, TTL, renewals; TTL index for auto-expiry.
  • fleet_factories (pk /productId) — descriptor, capabilities, health, load, last heartbeat.
  • fleet_profiles (pk /productId) — versioned profile snapshots.
  • fleet_events (pk /jobId) — append-only audit/event stream (stage changes, logs ptr, cost ticks, decisions).
  • Relate to existing tracker Item via tracker-item (no duplication of planning data).
  • Acceptance: repository CRUD + query tests per container; lease TTL expiry verified.
  • Verify gate: repository unit/integration tests (memory + Cosmos provider via DB_PROVIDER).

14. Phased build roadmap (checklists)

Each phase: Goal → checklist → Exit criteria. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress.

Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)

Goal: richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.

  • Extend agent-queue.sh frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible.
  • Add profiles/ directory + profile resolution (persona injection, default verify/caps/scope) (§6).
  • Local capability detection + a job/factory capability match check before launch (§8 subset).
  • priority ordering in the inbox pick (replace pure FIFO with priority-then-age).
  • deps (DAG) blocking on a single host; idempotency-key dedupe on add.
  • retry with backoff into failed/requeue; budget.wall enforced (extends timeout).
  • allowed-scope guardrail (warn-only this phase) + post-run diff report.
  • Tracker adapter aq from-tracker <ITEM> + aq to-tracker event poster (§10 P1).
  • Dashboard shows profile + priority + capability tags + tracker-item link.
  • Update selftest.sh with: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock).
  • Update README + this doc's progress table.
  • Exit criteria: all boxes ; selftest.sh green; a tracker task → executed → tracker done with SHA comment, fully on one host; no regression to Phase-0 .md files.

Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing

Goal: the service spine; ≥2 real factories executing in parallel via leases.

  • Scaffold fleet/orchestrator module in platform-service (types/repository/routes, Zod, ESM, productId).
  • Cosmos containers (§13) + repository layer (memory + Cosmos providers).
  • Claim/lease protocol endpoints + TTL expiry + reclaim (§8, §9).
  • Port agent-queue runner to a factory agent API client (register/heartbeat/claim/report) while keeping git-queue fallback.
  • Scheduler/router core (§7) as a pure module + wired into assignment.
  • Tracker adapter calls the module directly (not just file export).
  • Auth: factory tokens; scoped; secret isolation enforced (§12 subset).
  • Module test suite (repository + routes via @bytelyst/testing); crash-recovery + lease-expiry tests.
  • Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end.
  • Exit criteria: all boxes ; pnpm --filter @lysnrai/platform-service test green; killing a factory mid-job → another reclaims and completes; all state in Cosmos with productId.

Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router

Goal: one browser control plane; smart routing + budgets live.

  • fleet API client in tracker-web (reuse /api/tracker-style proxy → /fleet).
  • Fleet map page (factories, load, health, capabilities) on @bytelyst/* components.
  • Job table + job detail + DAG view; live log via SSE; approve/ship/reject/requeue actions.
  • Cost burndown + budget kill-switch UI; multi-reviewer routing.
  • Scoring router with configurable weights + explainability surfaced in UI.
  • Preemption of low-priority by critical jobs (checkpoint + requeue).
  • TUI dashboard re-pointed at /fleet API (parity).
  • Web e2e (Playwright): fleet map, live log, ship, budget-pause.
  • Exit criteria: all boxes ; web verify (typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume.

Phase 4 — Message bus + autoscaling + cross-OS capability marketplace

Goal: scale-out and elasticity.

  • Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability.
  • Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA.
  • Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products.
  • Load + chaos test suite (factory churn, broker outage, thundering herd).
  • Exit criteria: all boxes ; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll).

Phase 5 — Self-optimizing / learned routing

Goal: the scheduler learns from history to cut time/cost and raise first-pass success.

  • Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate).
  • Offline eval harness comparing learned vs heuristic routing on historical data.
  • Shadow/A-B rollout with guardrails; auto-tune scoring weights.
  • Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper").
  • Exit criteria: all boxes ; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric.

15. Cross-cutting feature catalog (quick index)

Feature First phase Section
Evolved job manifest P1 §5
Profiles (persona + capability) P1 §6
Capability matching P1→P2 §6/§8
Priority + SLA P1 §5/§7
DAG dependencies P1→P3 §5/§11
Idempotency / dedupe P1 §5
Retry + dead-letter P1→P2 §11
Budgets + kill-switch P1(wall)→P3 §5/§12
Scheduler/router scoring P2→P3 §7
Factory registration/heartbeat/lease P2 §8
Coordinator (platform-service module) P2 §9/§10
Cosmos data model P2 §13
Tracker bi-directional sync P1→P2 §10
Web control plane + SSE logs P3 §10/§17
Security/scope/secret isolation P1→P2 §12
Broker + autoscaling P4 §14
Learned routing P5 §14

16. Definition of Done — the "100% accuracy" rubric

A feature/phase is not done until every item below is true (this is the bar for "100% end-to-end"):

  • Functionality: acceptance criteria met; happy path + documented edge cases handled.
  • Tests: unit + integration written first or alongside, all green; no weakened/deleted tests; coverage targets met (router ≥95% core).
  • Verify gate: the phase's named gate command passes locally (and in CI where applicable).
  • Idempotency & recovery: re-runs are safe; crash mid-step recovers (lease/idempotency).
  • Security review: secret-leak scan clean; scope guardrail honored; least-privilege tokens.
  • Observability: events/logs/metrics emitted; failures are diagnosable from the control plane.
  • Docs: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current.
  • Backward-compat: existing .md/Phase-0 behavior unbroken (regression check).
  • Drift checks: shared-infra templates (.npmrc, docker-prep) untouched/synced; conventional commits.
  • No console.log/print in service code; req.log/os.Logger used; ESM .js imports.

17. Observability & control plane details

  • Live logs via SSE from factory → coordinator → web/TUI (single stream contract).
  • Metrics: queue depth, assign latency, run duration, verify pass-rate, cost, factory utilization, fairness.
  • Alerting: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter.
  • Tracing: a job's full timeline (queued→…→shipped) reconstructable from fleet_events.
  • Cost burndown per job/product/day with budget overlays.

18. Risks & gaps explicitly tracked (expert call-outs)

  • Duplicate execution across transports (git fallback + service) — mitigated by idempotency-key + lease.
  • Crash recovery — lease TTL + reclaim; checkpoint long jobs where engines allow.
  • Shared-package conflicts — two jobs editing @bytelyst/* simultaneously → lock + reviewer gate.
  • Starvation/fairness — per-product + per-factory counters with penalty.
  • Cost runaway — hard budgets + global kill switch.
  • Tool-version drift / reproducibility — record engine + tool versions per run; pin where possible.
  • Windows quirks — path/shell differences in the factory agent; capability-gate Windows-only work.
  • Human-review bottleneck — auto-verify as much as possible; batch review UI; reviewer routing.
  • Result capture beyond commits — artifacts (coverage, screenshots, build logs) attached to runs.
  • Secret sprawl — never in queue/manifest/logs/Cosmos; presence-only capabilities.
  • Data retention — event/log retention + archival policy (extend today's clean).
  • Engine API churn — engines mapped in one place (build_agent_cmd); capability matrix versioned.

19. Success metrics

  • Throughput: jobs shipped/day; parallel utilization (% of fleet busy).
  • Quality: % auto-verified, first-pass success rate, escaped-defect rate, human-edit rate post-agent.
  • Speed: mean time queued→shipped; assign latency.
  • Cost: $/shipped job; budget-breach rate.
  • Reliability: lease-reclaim success, dead-letter rate, factory uptime.
  • Fairness: max/min product wait-time ratio.

20. Open questions

  • Copilot headless feasibility as an engine/station (CLI/automation surface?).
  • Who owns merge/push authority — agents open PRs only, or auto-merge on green for low-risk profiles?
  • Multi-user/tenant: per-user queues + RBAC in the control plane?
  • On-call/ownership for the fleet (alerts routing, runbooks)?
  • Cloud factory provisioning (Phase 4) — which provider/runtime, cost guardrails?
  • Profile authorship/governance — who can create/edit profiles, and review of persona prompts?

This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.