- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision (factory x tool x profile routing) as a checklist-driven, phased implementation roadmap (Phase 0-5) with acceptance criteria, verify gates, and a 100% Definition-of-Done rubric - committed path: coordinator as a platform-service module + control plane on tracker-web, reached via a thin tracker adapter first; bash runner survives as the offline edge factory agent - README: add vision/roadmap pointer
32 KiB
Agent Gigafactory — Vision & Implementation Roadmap
One-liner: Evolve today's single-host
agent-queuebash runner into a distributed gigafactory — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler auto-picks jobs from a shared inbox and routes each.mdto the best factory × tool × profile — built service-side onplatform-service+tracker-web, with the bash runner surviving as the offline edge agent.
How to use this doc: It is both a PRD and an execution checklist. Every feature is a
- [ ]checkbox with acceptance criteria and a verify gate. A phase is "100% done" only when every box is checked, its gate passes, and the phase Definition of Done rubric (§16) is green. Update the progress table (§0) as you go.
0. Progress tracker
| Phase | Theme | Status | % | Gate |
|---|---|---|---|---|
| 0 | Baseline (today) | ✅ shipped | 100% | selftest.sh green |
| 1 | Manifest + profiles + capabilities + tracker adapter (single host) | ☐ not started | 0% | adapter e2e + selftest |
| 2 | Coordinator as platform-service module + Cosmos + multi-factory leasing | ☐ not started | 0% | fleet e2e + module tests |
| 3 | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests |
| 4 | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite |
| 5 | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B |
Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary.
1. Vision & metaphor
A gigafactory turns raw intent (.md task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network:
| Term | Meaning |
|---|---|
| Fleet | The whole network of machines under one control plane. |
| Factory | One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity. |
| Station | A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent). |
| Worker | A single running agent process executing one job at a station. |
| Job | A unit of work: a prompt/.md + manifest (profile, scope, gates, budget). |
| Profile | The role doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt + capability requirements. |
| Capability | A tag a factory advertises and a job requires (os:mac, has:xcode, has:figma, gpu, engine:devin). |
| Lease | A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery). |
| Gate | A checkpoint a job must pass: auto-QA verify, human review, ship approval. |
| Artifact | Any captured output: commits/PRs, logs, screenshots, reports, build outputs. |
North star: drop work into one inbox (or file a tracker task), and the fleet figures out where (factory), with what (tool/engine), as whom (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship.
┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐
│ plan/intake · roadmap · Fleet map · live logs · cost · approvals │
└───────────────▲───────────────────────────────────┬─────────────────────────┘
│ REST/SSE │
┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐
│ queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos) │
└───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘
│ claim/lease/report │ │ │
┌───────┴───────┐ ┌────────┴───────┐ ┌────────┴───────┐ ┌───────┴────────┐
│ FACTORY: mac │ │ FACTORY: ubuntu│ │FACTORY: windows│ │ FACTORY: mac-2 │
│ devin, claude │ │ codex, claude │ │ copilot, codex │ │ devin (xcode) │
│ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │
└───────────────┘ └────────────────┘ └────────────────┘ └────────────────┘
2. Current state (Phase 0 baseline — already shipped)
Today's agent-queue.sh + dashboard.mjs (single host, zero-dep bash + Node):
- Folder kanban lifecycle:
inbox → building → review → testing → shipped(+failed). - Auto-QA gate: agent rc=0 →
review/; optionalverify:runs incwd→ passtesting/, failfailed/; no verify → parks inreview/. Manualship= the human gate. - Per-job frontmatter:
engine(devin/claude/codex),cwd,yolo(→ dangerous/auto-approve),lock(per-repo serialization),timeout,verify. - Concurrency:
AGENT_QUEUE_MAX(default 3), per-lockserialization so same-repo jobs never collide. - State & logs:
.state/<job>.metaheartbeats +logs/<job>.log; git-tracked queue (audit-by-commit). - Interactive dashboard: numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to
agent-queue.sh.
Carries forward: the .md-in-inbox UX, frontmatter contract, lifecycle stage names, verify gate, lock/affinity concept, the bash runner itself (becomes the factory agent).
Must change for the fleet: single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane.
- Phase 0 complete — baseline shipped and self-tested. (reference, not a work item)
3. Goals & non-goals
Goals
- One intake, many machines: parallel execution across heterogeneous OS/tools.
- Automatic routing to the best
factory × tool × profilewith affinity, fairness, budget, and health awareness. - Self-healing (lease expiry/requeue), quality gates, and full observability.
- Reuse the ByteLyst stack (
platform-service, Cosmos,@bytelyst/*, tracker-web) — no parallel infra. - Preserve offline/zero-dep edge operation via the bash runner.
Non-goals
- Not a CI/CD replacement (it triggers CI; CI still gates merges).
- Not a general-purpose workflow engine (scoped to coding-agent execution).
- Not a model/inference host (it orchestrates agent CLIs, doesn't serve models).
- Not abandoning the simple
.mdmental model — humans still drop files / file tasks.
4. Core concepts contract (must hold across all phases)
- Every job has a stable id, an immutable manifest, and an append-only event log.
- Every Cosmos document carries
productId(ByteLyst rule). - A job in flight is always covered by exactly one lease; no lease → reclaimable.
- Lifecycle stages are canonical and shared:
queued → assigned → building → review → testing → shipped(+failed,dead_letter). - The bash runner and the service speak the same manifest + event vocabulary (one schema, two transports).
5. The evolved Job manifest (feature)
Extend today's frontmatter into a richer, backward-compatible manifest. Old .md files keep working (new fields optional with sane defaults).
---
# --- existing (unchanged) ---
engine: devin # explicit engine; overrides profile/engine-class
cwd: /abs/path/repo
yolo: true
lock: my-repo
timeout: 45m
verify: pnpm -s test
# --- new ---
profile: backend-engineer # role: persona + capability requirements
engine-class: agentic-coder # abstract; scheduler picks a concrete engine if `engine` unset
capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy
prefers: [factory:mac-2] # soft routing hints (affinity)
priority: high # critical|high|medium|low → SLA + preemption
budget: { usd: 5, tokens: 2M, wall: 4h } # hard ceilings; exceed → pause/fail
deps: [job-123, job-456] # DAG: don't start until these reach `shipped`/`testing`
idempotency-key: nomgap-ux-2 # dedupe: a second identical submit is a no-op
retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] }
review-policy: manual # auto|manual|reviewers:[@alice]
artifacts: [coverage, screenshots] # what to capture beyond commits
tracker-item: ITEM-789 # link back to the originating tracker task
---
- Define the manifest schema (Zod in the service; documented YAML for
.md). - Backward-compat: a Phase-0
.md(onlyengine/cwd/yolo) parses with all new fields defaulted. idempotency-keydedupe semantics specified (same key + same content hash = no-op).depsDAG semantics specified (blocked state, cycle detection, fan-in/out).- Acceptance: a manifest fixture suite parses/validates; invalid manifests fail with precise errors.
- Verify gate: schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases).
6. Profiles — persona + capability (feature)
A profile = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as profiles/<name>.md (Phase 1) → Cosmos profiles container (Phase 2).
# profiles/backend-engineer.md
---
name: backend-engineer
persona: |
You are a senior backend engineer. Favor minimal, well-tested changes...
capabilities: [node>=20, has:pnpm]
default-verify: pnpm -s typecheck && pnpm -s test
engine-class: agentic-coder
prefers-engine: [devin, claude]
allowed-scope: ["backend/**", "packages/**"] # blast-radius guardrail
review-policy: manual
---
- Author starter catalog:
developer,backend-engineer,frontend-engineer,ux-designer,ui-designer,qa,reviewer,docs-writer. - Persona overlay is prepended to the job body before the agent runs (and stripped from logs of secrets).
- Profile supplies default
verify,capabilities,engine-class,allowed-scopewhen the job omits them. - Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time).
allowed-scopeenforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check).- Acceptance: a job with
profile: backend-engineerand noverifyinherits the profile's verify + persona. - Verify gate: profile-resolution unit tests; persona-injection golden test.
7. The scheduler / router (the heart) (feature)
Given a queued job and the current fleet, choose (factory, station/engine, profile) and issue a lease.
Inputs: job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters.
Algorithm (deterministic, explainable):
- Filter factories by hard capability match (job ∪ profile capabilities ⊆ factory capabilities) and free station for a compatible engine.
- Block if
depsunmet orlockalready held → leavequeued/blocked. - Score each candidate factory:
score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health − w6·starvationPenalty - Tie-break: highest priority job first; then oldest; then lowest cost class.
- Assign → write lease (TTL), set job
assigned, decrement station capacity, bump fairness counter. - Preemption (P3+): a
criticaljob may pause alowjob at a needed station (checkpoint + requeue).
- Implement pure, unit-testable scoring function (no I/O) with configurable weights.
- Hard-filter correctness: never assign a job to a factory missing a required capability.
- Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware).
- Fairness: no factory or product starves under sustained load (counter + penalty).
- Explainability: every assignment records why (matched caps, score breakdown) in the event log.
- Determinism: same inputs → same decision (seeded tie-breaks) for testability.
- Acceptance: scenario fixtures (10+) produce expected assignments incl. starvation + capability-miss + budget-exceed.
- Verify gate: router unit suite ≥ 95% branch coverage on the scoring/filter core.
8. Factory model & registration (feature)
Each machine runs a factory agent (the evolved agent-queue runner) that registers, heartbeats, claims jobs, and reports events.
- Capability auto-detection at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
- Registration:
POST /fleet/factorieswith descriptor → receives a factory id + token. - Heartbeat: periodic
PUT /fleet/factories/:id/heartbeat(load, free stations, health); missed N → factory markedoffline, its leases reclaimed. - Claim loop:
POST /fleet/leases/claimadvertising capabilities/free stations; receives a job + lease TTL. - Report: stream stage/log/event back (
POST /fleet/runs/:id/events); renew lease while alive. - Graceful drain: factory can stop claiming, finish in-flight, deregister.
- Acceptance: a factory registers, claims a matching job, heartbeats, completes, and a killed factory's job is reclaimed by another within the lease TTL.
- Verify gate: factory-agent integration test against a mock coordinator; crash-recovery test.
9. Coordination architecture (decision + path)
Three transports were evaluated. Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| (a) Git-synced queue (evolve folders) | zero infra, audit-by-commit, offline | weak/racey leasing, latency, merge churn | Edge/offline only |
| (b) Coordinator service (platform-service module) | real leases, fairness, observability, reuses auth/Cosmos/productId | a service to run | Chosen spine (P2) |
| (c) Message broker (NATS/Redis/SQS) | scale, backpressure, push dispatch | most moving parts/ops | P4 when throughput demands |
- Document the decision + rationale in-repo (this section is the canonical record).
- Define the claim/lease protocol once; both git-queue (poll) and service (API) implement it.
- Offline-degrade: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect (idempotency-key prevents double-execution).
- Acceptance: the same job manifest runs identically through the bash/git path and the service path.
- Verify gate: contract test asserting protocol parity (git vs service).
10. tracker-web / platform-service integration (committed path)
Layering: tracker = WHAT/WHY (plan, intake, prioritize, roadmap, votes) · gigafactory = HOW (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real tracker-service model (Item: type bug/feature/task, status open/in_progress/done/closed/wont_fix, priority, labels, assignee, source incl. auto_detected, votes, comments, public roadmap) and the tracker-web /api/tracker/[...path] proxy pattern.
Phase 1 — Adapter (no new infra)
- task → job: a tracker
Itemoftype: task(e.g.assignee: @agentor labelagent:run) is exported to a job.md(manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints). - job → tracker: lifecycle events post back as status updates + comments —
building→ statusin_progress+ comment "started on factory X";shipped→done+ comment with commit SHAs / PR link / verify results;failed→ comment with reason (status staysin_progressfor human triage). - Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash).
- Adapter is a thin script/CLI (
aq from-tracker ITEM-789) + optional poller. - Acceptance: filing a tracker task, marking it
agent:run, results in a queued job; on ship, the item flips todonewith a SHA comment. - Verify gate: adapter e2e against a tracker-service test instance (or mock); round-trip assertion.
Phase 2 — Native spine
- Stand up a
fleet(a.k.a.orchestrator) module inside platform-service, sibling totracker-service: patterntypes.ts → repository.ts → routes.ts, ESM, Cosmos,productId,req.log. - Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats.
- Runners (bash + any) become API clients of this module; tracker adapter calls it directly.
- Acceptance: a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos.
- Verify gate: module test suite (repository + routes) using the shared
@bytelyst/testinginject helpers.
Phase 3 — Unified control plane
- Add a Fleet surface to
tracker-webreusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, live log streaming (SSE), lease/heartbeat status, cost burndown, approve/ship buttons. - The Node TUI dashboard becomes a thin client of the same
/fleetAPI (parity with web). - Acceptance: an operator can watch all factories + tail any job log + ship from the browser.
- Verify gate: web e2e (Playwright) covering fleet map render, live log, and a ship action.
11. Lifecycle & gates at scale (feature)
- Canonical stages enforced server-side:
queued → assigned → building → review → testing → shipped(+failed,dead_letter). - Per-profile default
verify; per-job override; verify runs at the factory, result reported as an event. - Human gates:
review-policyroutes to reviewers; multi-reviewer support (P3). - Dead-letter: after
retry.maxexhausted, job →dead_letterwith full diagnostics; never silently dropped. - Backpressure: when no factory can take more, jobs stay
queued(no thrash); SLA timers visible. - Acceptance: a perpetually-failing job lands in
dead_letterafter configured retries; a passing one auto-advances totestingthen waits for humanship. - Verify gate: lifecycle state-machine unit tests (all transitions + illegal-transition rejection).
12. Security, safety & governance (feature — critical with yolo/dangerous)
- Secret isolation: creds live on each factory (env/keychain), never in the queue, manifest, logs, or Cosmos. Factory advertises presence of a cred capability, not the value.
- Scoped git tokens per factory/repo; least-privilege; rotation documented.
- Push policy: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag.
- Blast-radius guardrail: enforce
allowed-scope— pre-flight + post-run diff check; out-of-scope changes block the ship gate. - Budget kill-switch: exceed
budget(usd/tokens/wall) → pause worker, alert, require human resume. - Supply-chain safety: edits to shared
@bytelyst/*packages requirereviewerprofile + human gate (never auto-ship). - Audit trail: append-only event log per job (who/what/when/where/cost); immutable.
- Corp network/proxy: honor
NETWORK/proxy + truststore conventions on factories that need them. - Kill switch (global): one command/flag halts all claiming fleet-wide (incident response).
- Acceptance: a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test).
- Verify gate: security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test.
13. Data model (Cosmos containers, P2+)
Each container partitioned sensibly; every doc has productId.
fleet_jobs(pk/productId) — manifest snapshot, current stage, idempotency-key, tracker-item link.fleet_runs(pk/jobId) — one per execution attempt: factory, engine, profile snapshot, timings, cost, exit, verify result.fleet_leases(pk/jobId) — holder factory, TTL, renewals; TTL index for auto-expiry.fleet_factories(pk/productId) — descriptor, capabilities, health, load, last heartbeat.fleet_profiles(pk/productId) — versioned profile snapshots.fleet_events(pk/jobId) — append-only audit/event stream (stage changes, logs ptr, cost ticks, decisions).- Relate to existing tracker
Itemviatracker-item(no duplication of planning data). - Acceptance: repository CRUD + query tests per container; lease TTL expiry verified.
- Verify gate: repository unit/integration tests (memory + Cosmos provider via
DB_PROVIDER).
14. Phased build roadmap (checklists)
Each phase: Goal → checklist → Exit criteria. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress.
Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)
Goal: richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.
- Extend
agent-queue.shfrontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible. - Add
profiles/directory + profile resolution (persona injection, default verify/caps/scope) (§6). - Local capability detection + a job/factory capability match check before launch (§8 subset).
priorityordering in the inbox pick (replace pure FIFO with priority-then-age).deps(DAG) blocking on a single host;idempotency-keydedupe onadd.retrywith backoff intofailed/requeue;budget.wallenforced (extendstimeout).allowed-scopeguardrail (warn-only this phase) + post-run diff report.- Tracker adapter
aq from-tracker <ITEM>+aq to-trackerevent poster (§10 P1). - Dashboard shows profile + priority + capability tags + tracker-item link.
- Update
selftest.shwith: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock). - Update README + this doc's progress table.
- Exit criteria: all boxes ✅;
selftest.shgreen; a tracker task → executed → trackerdonewith SHA comment, fully on one host; no regression to Phase-0.mdfiles.
Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing
Goal: the service spine; ≥2 real factories executing in parallel via leases.
- Scaffold
fleet/orchestratormodule inplatform-service(types/repository/routes, Zod, ESM,productId). - Cosmos containers (§13) + repository layer (memory + Cosmos providers).
- Claim/lease protocol endpoints + TTL expiry + reclaim (§8, §9).
- Port
agent-queuerunner to a factory agent API client (register/heartbeat/claim/report) while keeping git-queue fallback. - Scheduler/router core (§7) as a pure module + wired into assignment.
- Tracker adapter calls the module directly (not just file export).
- Auth: factory tokens; scoped; secret isolation enforced (§12 subset).
- Module test suite (repository + routes via
@bytelyst/testing); crash-recovery + lease-expiry tests. - Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end.
- Exit criteria: all boxes ✅;
pnpm --filter @lysnrai/platform-service testgreen; killing a factory mid-job → another reclaims and completes; all state in Cosmos withproductId.
Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router
Goal: one browser control plane; smart routing + budgets live.
fleetAPI client intracker-web(reuse/api/tracker-style proxy →/fleet).- Fleet map page (factories, load, health, capabilities) on
@bytelyst/*components. - Job table + job detail + DAG view; live log via SSE; approve/ship/reject/requeue actions.
- Cost burndown + budget kill-switch UI; multi-reviewer routing.
- Scoring router with configurable weights + explainability surfaced in UI.
- Preemption of low-priority by critical jobs (checkpoint + requeue).
- TUI dashboard re-pointed at
/fleetAPI (parity). - Web e2e (Playwright): fleet map, live log, ship, budget-pause.
- Exit criteria: all boxes ✅; web
verify(typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume.
Phase 4 — Message bus + autoscaling + cross-OS capability marketplace
Goal: scale-out and elasticity.
- Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability.
- Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA.
- Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products.
- Load + chaos test suite (factory churn, broker outage, thundering herd).
- Exit criteria: all boxes ✅; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll).
Phase 5 — Self-optimizing / learned routing
Goal: the scheduler learns from history to cut time/cost and raise first-pass success.
- Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate).
- Offline eval harness comparing learned vs heuristic routing on historical data.
- Shadow/A-B rollout with guardrails; auto-tune scoring weights.
- Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper").
- Exit criteria: all boxes ✅; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric.
15. Cross-cutting feature catalog (quick index)
| Feature | First phase | Section |
|---|---|---|
| Evolved job manifest | P1 | §5 |
| Profiles (persona + capability) | P1 | §6 |
| Capability matching | P1→P2 | §6/§8 |
| Priority + SLA | P1 | §5/§7 |
| DAG dependencies | P1→P3 | §5/§11 |
| Idempotency / dedupe | P1 | §5 |
| Retry + dead-letter | P1→P2 | §11 |
| Budgets + kill-switch | P1(wall)→P3 | §5/§12 |
| Scheduler/router scoring | P2→P3 | §7 |
| Factory registration/heartbeat/lease | P2 | §8 |
| Coordinator (platform-service module) | P2 | §9/§10 |
| Cosmos data model | P2 | §13 |
| Tracker bi-directional sync | P1→P2 | §10 |
| Web control plane + SSE logs | P3 | §10/§17 |
| Security/scope/secret isolation | P1→P2 | §12 |
| Broker + autoscaling | P4 | §14 |
| Learned routing | P5 | §14 |
16. Definition of Done — the "100% accuracy" rubric
A feature/phase is not done until every item below is true (this is the bar for "100% end-to-end"):
- Functionality: acceptance criteria met; happy path + documented edge cases handled.
- Tests: unit + integration written first or alongside, all green; no weakened/deleted tests; coverage targets met (router ≥95% core).
- Verify gate: the phase's named gate command passes locally (and in CI where applicable).
- Idempotency & recovery: re-runs are safe; crash mid-step recovers (lease/idempotency).
- Security review: secret-leak scan clean; scope guardrail honored; least-privilege tokens.
- Observability: events/logs/metrics emitted; failures are diagnosable from the control plane.
- Docs: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current.
- Backward-compat: existing
.md/Phase-0 behavior unbroken (regression check). - Drift checks: shared-infra templates (
.npmrc,docker-prep) untouched/synced; conventional commits. - No
console.log/printin service code;req.log/os.Loggerused; ESM.jsimports.
17. Observability & control plane details
- Live logs via SSE from factory → coordinator → web/TUI (single stream contract).
- Metrics: queue depth, assign latency, run duration, verify pass-rate, cost, factory utilization, fairness.
- Alerting: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter.
- Tracing: a job's full timeline (queued→…→shipped) reconstructable from
fleet_events. - Cost burndown per job/product/day with budget overlays.
18. Risks & gaps explicitly tracked (expert call-outs)
- Duplicate execution across transports (git fallback + service) — mitigated by
idempotency-key+ lease. - Crash recovery — lease TTL + reclaim; checkpoint long jobs where engines allow.
- Shared-package conflicts — two jobs editing
@bytelyst/*simultaneously → lock + reviewer gate. - Starvation/fairness — per-product + per-factory counters with penalty.
- Cost runaway — hard budgets + global kill switch.
- Tool-version drift / reproducibility — record engine + tool versions per run; pin where possible.
- Windows quirks — path/shell differences in the factory agent; capability-gate Windows-only work.
- Human-review bottleneck — auto-verify as much as possible; batch review UI; reviewer routing.
- Result capture beyond commits — artifacts (coverage, screenshots, build logs) attached to runs.
- Secret sprawl — never in queue/manifest/logs/Cosmos; presence-only capabilities.
- Data retention — event/log retention + archival policy (extend today's
clean). - Engine API churn — engines mapped in one place (
build_agent_cmd); capability matrix versioned.
19. Success metrics
- Throughput: jobs shipped/day; parallel utilization (% of fleet busy).
- Quality: % auto-verified, first-pass success rate, escaped-defect rate, human-edit rate post-agent.
- Speed: mean time queued→shipped; assign latency.
- Cost: $/shipped job; budget-breach rate.
- Reliability: lease-reclaim success, dead-letter rate, factory uptime.
- Fairness: max/min product wait-time ratio.
20. Open questions
- Copilot headless feasibility as an engine/station (CLI/automation surface?).
- Who owns merge/push authority — agents open PRs only, or auto-merge on green for low-risk profiles?
- Multi-user/tenant: per-user queues + RBAC in the control plane?
- On-call/ownership for the fleet (alerts routing, runbooks)?
- Cloud factory provisioning (Phase 4) — which provider/runtime, cost guardrails?
- Profile authorship/governance — who can create/edit profiles, and review of persona prompts?
This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.