Saravanakumar D 257efcb4bc docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/

Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-30 21:01:23 -07:00

68 KiB

Raw Blame History

Agent Gigafactory — Vision & Implementation Roadmap

One-liner: Evolve today's single-host agent-queue bash runner into a distributed gigafactory — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler auto-picks jobs from a shared inbox and routes each .md to the best factory × tool × profile — built service-side on platform-service + tracker-web, with the bash runner surviving as the offline edge agent.

How to use this doc: It is both a PRD and an execution checklist. Every feature is a - [ ] checkbox with acceptance criteria and a verify gate. A phase is "100% done" only when every box is checked, its gate passes, and the phase Definition of Done rubric (§16) is green. Update the progress table (§0) as you go.

0. Progress tracker

Phase	Theme	Status	%	Gate
0	Baseline (today)	✅ shipped	100%	`selftest.sh` green
1	Manifest + profiles + capabilities + tracker adapter (single host)	✅ done	~98%	adapter e2e + selftest
2	Coordinator as platform-service module + Cosmos + multi-factory leasing	✅ done	~98%	fleet e2e + module tests
3	Fleet control plane in tracker-web + DAG deps + budgets + scoring router	✅ done	100%	web e2e + router tests
4	Message bus + autoscaling + cross-OS capability marketplace	☐ not started	0%	load/chaos suite
5	Self-optimizing / learned routing	☐ not started	0%	offline eval + A/B

Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19. For the full current-state architecture, diagrams, code map, next steps and known gaps see GIGAFACTORY_SYSTEM_OVERVIEW.md (companion doc).

1. Vision & metaphor

A gigafactory turns raw intent (.md task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network:

Term	Meaning
Fleet	The whole network of machines under one control plane.
Factory	One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity.
Station	A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent).
Worker	A single running agent process executing one job at a station.
Job	A unit of work: a prompt/`.md` + manifest (profile, scope, gates, budget).
Profile	The role doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt + capability requirements.
Capability	A tag a factory advertises and a job requires (`os:mac`, `has:xcode`, `has:figma`, `gpu`, `engine:devin`).
Lease	A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery).
Gate	A checkpoint a job must pass: auto-QA `verify`, human review, ship approval.
Artifact	Any captured output: commits/PRs, logs, screenshots, reports, build outputs.

North star: drop work into one inbox (or file a tracker task), and the fleet figures out where (factory), with what (tool/engine), as whom (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship.

                         ┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐
                         │  plan/intake · roadmap · Fleet map · live logs · cost · approvals           │
                         └───────────────▲───────────────────────────────────┬─────────────────────────┘
                                         │ REST/SSE                           │
            ┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐
            │  queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos)  │
            └───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘
                │ claim/lease/report     │                       │                       │
        ┌───────┴───────┐       ┌────────┴───────┐       ┌────────┴───────┐       ┌───────┴────────┐
        │  FACTORY: mac │       │ FACTORY: ubuntu│       │FACTORY: windows│       │ FACTORY: mac-2 │
        │ devin, claude │       │ codex, claude  │       │ copilot, codex │       │ devin (xcode)  │
        │ [agent-queue] │       │ [agent-queue]  │       │ [agent-queue]  │       │ [agent-queue]  │
        └───────────────┘       └────────────────┘       └────────────────┘       └────────────────┘

2. Current state (Phase 0 baseline — already shipped)

Today's agent-queue.sh + dashboard.mjs (single host, zero-dep bash + Node):

Folder kanban lifecycle: inbox → building → review → testing → shipped (+ failed).
Auto-QA gate: agent rc=0 → review/; optional verify: runs in cwd → pass testing/, fail failed/; no verify → parks in review/. Manual ship = the human gate.
Per-job frontmatter: engine (devin/claude/codex), cwd, yolo (→ dangerous/auto-approve), lock (per-repo serialization), timeout, verify.
Concurrency: AGENT_QUEUE_MAX (default 3), per-lock serialization so same-repo jobs never collide.
State & logs: .state/<job>.meta heartbeats + logs/<job>.log; git-tracked queue (audit-by-commit).
Interactive dashboard: numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to agent-queue.sh.

Carries forward: the .md-in-inbox UX, frontmatter contract, lifecycle stage names, verify gate, lock/affinity concept, the bash runner itself (becomes the factory agent). Must change for the fleet: single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane.

Phase 0 complete — baseline shipped and self-tested. (reference, not a work item)

3. Goals & non-goals

Goals

One intake, many machines: parallel execution across heterogeneous OS/tools.
Automatic routing to the best factory × tool × profile with affinity, fairness, budget, and health awareness.
Self-healing (lease expiry/requeue), quality gates, and full observability.
Reuse the ByteLyst stack (platform-service, Cosmos, @bytelyst/*, tracker-web) — no parallel infra.
Preserve offline/zero-dep edge operation via the bash runner.

Non-goals

Not a CI/CD replacement (it triggers CI; CI still gates merges).
Not a general-purpose workflow engine (scoped to coding-agent execution).
Not a model/inference host (it orchestrates agent CLIs, doesn't serve models).
Not abandoning the simple .md mental model — humans still drop files / file tasks.

4. Core concepts contract (must hold across all phases)

Every job has a stable id, an immutable manifest, and an append-only event log.
Every Cosmos document carries productId (ByteLyst rule).
A job in flight is always covered by exactly one lease; no live lease → reclaimable.
Atomic claim: a job is assigned to exactly one worker via optimistic concurrency (Cosmos _etag/If-Match or a conditional fleet_leases insert keyed by jobId). Concurrent claimers — exactly one wins; losers retry the next candidate.
Fencing token: every lease carries a monotonic leaseEpoch. Every report/commit/ship carries its epoch; the coordinator rejects writes from a stale epoch, so a partitioned or zombie worker cannot corrupt state after its lease was reclaimed.
Coordinator-authoritative time: all lease/TTL/SLA math uses server timestamps, never factory clocks (clock-skew safety).
Lifecycle stages are canonical and shared: queued → assigned → building → review → testing → shipped (+ blocked, failed, dead_letter).
The bash runner and the service speak the same manifest + event vocabulary (one schema, two transports).

Implementation status (2026-05-29) — Phase 2 Foundation merged (common-plat PR #28, platform-service/src/modules/fleet/): all 7 fleet_* containers (§13) ✓; repositories + coordinator (claim/lease/fence/heartbeat/reaper) ✓; idempotency + deps + submit-time cycle detection ✓; 50 module tests green. ✓ P0 hardening landed (2026-05-29, common-plat PR #29) — atomic claim is now truly concurrency-safe. Added updateIfMatch to @bytelyst/datastore: Cosmos conditions the replace on _etag via accessCondition {type:'IfMatch'} (412 → conflict) plus a rev compare for the pre-read window; the Memory provider does get→compare→set in one synchronous block (no await between), so concurrent callers cannot interleave. fleet revUpdate* now write conditionally. Proven by Promise.all 2-contender + N-claimer stress + concurrent claimNextJob/lease-renew tests (these fail on the old read-check-write, pass now). datastore 48 + fleet 53 green; full workspace build/test clean; no consumer regressed. P2-S3 (factory integration) is now unblocked.

5. The evolved Job manifest (feature)

Extend today's frontmatter into a richer, backward-compatible manifest. Old .md files keep working (new fields optional with sane defaults).

---
# --- existing (unchanged) ---
engine: devin            # explicit engine; overrides profile/engine-class
cwd: /abs/path/repo
yolo: true
lock: my-repo
timeout: 45m
verify: pnpm -s test
# --- new ---
profile: backend-engineer        # role: persona + capability requirements
engine-class: agentic-coder      # abstract; scheduler picks a concrete engine if `engine` unset
capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy
prefers: [factory:mac-2]         # soft routing hints (affinity)
priority: high                   # critical|high|medium|low → SLA + preemption
budget: { usd: 5, tokens: 2M, wall: 4h }   # wall = HARD ceiling (always enforceable). usd/tokens = best-effort
                                           # caps: enforced only where the engine/provider exposes live metering;
                                           # otherwise estimated from provider usage APIs post-hoc + alerted.
deps: [job-123, job-456]         # DAG: don't start until these reach `shipped`/`testing`
idempotency-key: nomgap-ux-2     # dedupe: a second identical submit is a no-op
retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] }
review-policy: manual            # auto|manual|reviewers:[@alice]
artifacts: [coverage, screenshots]   # what to capture beyond commits
tracker-item: ITEM-789           # link back to the originating tracker task
---

Define the manifest schema (Zod in the service; documented YAML for .md).
Backward-compat: a Phase-0 .md (only engine/cwd/yolo) parses with all new fields defaulted. (P1-S1: bash runner; Zod schema still P2. selftest backward-compat case green.)
Capability grammar defined: tokens are key (presence, e.g. has:xcode), key:value (e.g. os:mac, engine:devin), or key<op>version with op ∈ {>=,>,=,<=,<} (e.g. node>=20). os:any is a wildcard that matches every factory. A job matches a factory iff every required token is satisfied by the factory descriptor. (P1-S1: caps_match/detect_capabilities in agent-queue.sh.)
engine-class taxonomy defined as an enum (agentic-coder, chat-coder, review-only) with a documented engine→class map (devin,claude,codex → agentic-coder; copilot → chat-coder). If engine is set it wins; else the scheduler picks any free engine in the class honoring prefers-engine. (P1-S1: resolve_engine; review-only mapping reserved.)
idempotency-key semantics: key + content-hash identical ⇒ no-op (returns existing job). Same key, different content ⇒ rejected with 409 unless the prior job is still queued/blocked (then it is superseded). A re-run/retry of an existing job is not a new submit and never trips dedupe. (P1-S1: add-time dedupe; bash maps "409" → clear error, queued → still in inbox/ ⇒ superseded.)
deps semantics: a dep is satisfied when it reaches shipped (default) or testing if deps-mode: soft. Submit-time cycle detection rejects cyclic graphs; unmet deps put the job in blocked (not queued). Cross-factory deps require the coordinator (P2); single-host deps work in P1. (P1-S2: deps_unmet skip-with-reason in selection + status surfacing; deps_would_cycle on add. Cross-machine deps remain P2.)
Acceptance: a manifest fixture suite parses/validates; invalid manifests fail with precise errors; capability-grammar + dep-cycle + idempotency-conflict cases covered.
Verify gate: schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases + grammar/cycle/409 cases).

6. Profiles — persona + capability (feature)

A profile = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as profiles/<name>.md (Phase 1) → Cosmos profiles container (Phase 2).

# profiles/backend-engineer.md
---
name: backend-engineer
persona: |
  You are a senior backend engineer. Favor minimal, well-tested changes...  
capabilities: [node>=20, has:pnpm]
default-verify: pnpm -s typecheck && pnpm -s test
engine-class: agentic-coder
prefers-engine: [devin, claude]
allowed-scope: ["backend/**", "packages/**"]   # blast-radius guardrail
review-policy: manual
---

Author starter catalog: developer, backend-engineer, frontend-engineer, ux-designer, ui-designer, qa, reviewer, docs-writer. (P1-S2: profiles/*.md + a reserved planner.)
Persona overlay is prepended to the job body before the agent runs; secrets are never written to logs or the event stream (redaction at the source). (P1-S2: profile_persona prepended to the stripped body file.)
Profile supplies default verify, capabilities, engine-class, allowed-scope when the job omits them. (P1-S2: fm_eff — also prefers-engine + review-policy; job fields always override.)
Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time). (P2 — needs Cosmos snapshot at assign time.)
allowed-scope enforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check). (P1-S2: scope_check post-run WARN-only + scope_warning= in meta; path_in_scope unit-testable.)
Acceptance: a job with profile: backend-engineer and no verify inherits the profile's verify + persona.
Verify gate: profile-resolution unit tests; persona-injection golden test.

7. The scheduler / router (the heart) (feature)

Given a queued job and the current fleet, choose (factory, station/engine, profile) and issue a lease.

Inputs: job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters.

Algorithm (deterministic, explainable):

Filter factories by hard capability match (job ∪ profile capabilities ⊆ factory capabilities) and free station for a compatible engine.
Block if deps unmet or lock already held → leave queued/blocked.
Score each candidate factory: score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health − w6·starvationPenalty
Tie-break: highest priority job first; then oldest; then lowest cost class.
Assign atomically → create the lease under an optimistic-concurrency guard (_etag/If-Match or conditional insert keyed by jobId) with a fresh leaseEpoch; on conflict another factory won → retry the next candidate. Set job assigned, decrement station/seat capacity, bump fairness counter. Use coordinator-authoritative timestamps only.
Preemption (P3+): a critical job may pause a low job at a needed station (checkpoint + requeue, bumping the preempted job's leaseEpoch).

Phasing: Phase 2 ships the deterministic filter + atomic-assign core (fixed weights). Phase 3 adds tunable weights, preemption, and the explainability UI. Phase 5 learns the weights (§14).

Implement pure, unit-testable scoring function (no I/O) with configurable weights.
Hard-filter correctness: never assign a job to a factory missing a required capability.
Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware).
Fairness: no factory or product starves under sustained load (counter + penalty).
Explainability: every assignment records why (matched caps, score breakdown) in the event log.
Determinism: same inputs → same decision (seeded tie-breaks) for testability.
Define factory health ∈ [0,1] = f(heartbeat freshness, recent run failure-rate, resource pressure); factories below a health floor are filtered out, not merely down-weighted.
Station/seat capacity: a factory's free stations = min(host slots, per-engine seat limits) (e.g. licensed Devin/Claude seats); the scheduler never over-subscribes a seat-limited engine.
Distributed lock: the Phase-0 local lock becomes a coordinator-held lock so same-lock jobs serialize across the whole fleet (prevents two factories pushing the same repo concurrently).
Acceptance: scenario fixtures (10+) produce expected assignments incl. starvation, capability-miss, seat-exhaustion, unhealthy-factory exclusion, and budget-exceed; a concurrent-claim race test proves exactly one winner.
Verify gate: router unit suite ≥ 95% branch coverage on the scoring/filter core; atomic-claim race test.

8. Factory model & registration (feature)

Each machine runs a factory agent (the evolved agent-queue runner) that registers, heartbeats, claims jobs, and reports events.

Capability auto-detection at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
Enrollment / bootstrap trust: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a scoped, rotatable factory token (jose JWT); decommission = revoke. No standing shared secret in the queue.
Registration: POST /fleet/factories with descriptor → receives a factory id + token.
Heartbeat: periodic PUT /fleet/factories/:id/heartbeat (load, free stations, health). A coordinator lease reaper (not Cosmos TTL) sweeps expiresAt < now and reclaims, bumping leaseEpoch so the dead/zombie worker is fenced; a factory missing N heartbeats is marked offline and all its leases reclaimed.
Claim loop: POST /fleet/leases/claim advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + leaseEpoch. Use claim backoff / long-poll to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4.
Report: stream stage/log/event back (POST /fleet/runs/:id/events), echoing leaseEpoch (stale epoch → 409, worker self-aborts); renew lease while alive.
Environment prep: before verify, the factory ensures deps are installed (cold checkout → pnpm install); prep time counts against budget.wall.
Graceful drain: factory can stop claiming, finish in-flight, deregister.
Acceptance: a factory enrolls, claims a matching job, heartbeats, completes; a killed factory's job is reclaimed by another within the lease TTL and the killed worker's late report is rejected by fencing.
Verify gate: factory-agent integration test against a mock coordinator; crash-recovery + fencing-rejection test.

9. Coordination architecture (decision + path)

Three transports were evaluated. Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.

Option	Pros	Cons	Verdict
(a) Git-synced queue (evolve folders)	zero infra, audit-by-commit, offline	weak/racey leasing, latency, merge churn	Edge/offline only
(b) Coordinator service (platform-service module)	real leases, fairness, observability, reuses auth/Cosmos/productId	a service to run	Chosen spine (P2)
(c) Message broker (NATS/Redis/SQS)	scale, backpressure, push dispatch	most moving parts/ops	P4 when throughput demands

Document the decision + rationale in-repo (this section is the canonical record).
Define the claim/lease protocol once; both git-queue (poll) and service (API) implement it.
Split-brain / network-partition safety: a partitioned factory may keep running and even git push. idempotency-key dedupes submits but cannot undo side-effects. Mitigation: fencing — the coordinator rejects ship/merge reports from a stale leaseEpoch, and the distributed lock (§7) prevents a reclaimed-job's twin from pushing the same repo. Residual risk (a stale push to a feature branch) is contained by the PR-merge ship gate (§10) and surfaced for human triage.
Offline-degrade: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect; on reconnect it presents its leaseEpoch — if reclaimed, its results are quarantined, not auto-merged.
Poll cost: bound claim-loop RU via long-poll/backoff (§22); migrate to broker push at P4.
Acceptance: the same job manifest runs identically through the bash/git path and the service path; a simulated partition does not double-merge (fencing test).
Verify gate: contract test asserting protocol parity (git vs service) + partition/fencing test.

10. tracker-web / platform-service integration (committed path)

Layering: tracker = WHAT/WHY (plan, intake, prioritize, roadmap, votes) · gigafactory = HOW (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real tracker-service model (Item: type bug/feature/task, status open/in_progress/done/closed/wont_fix, priority, labels, assignee, source incl. auto_detected, votes, comments, public roadmap) and the tracker-web /api/tracker/[...path] proxy pattern.

Phase 1 — Adapter (no new infra)

task → job: a tracker Item of type: task (e.g. assignee: @agent or label agent:run) is exported to a job .md (manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints). (P1-S4: aq from-tracker; labels engine-class:/profile:/priority:/cap: → frontmatter.)
job → tracker: lifecycle events post back as status updates + comments — building → status in_progress + comment "started on factory X"; shipped → done + comment with commit SHAs / PR link / verify results; failed → comment with reason (status stays in_progress for human triage). (P1-S4: aq to-tracker PATCHes status + posts a metrics-only comment; one-way echo §24.5; never fatal. The items API has no blocked/failed status, so failures map to wont_fix by default — override via AQ_TRACKER_STATUS_FAILED.)
Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash). (P1-S4: derived idempotency-key: tracker-<id> reuses Slice 1 dedupe; to-tracker is idempotent via tracker_echoed.)
Adapter is a thin script/CLI (aq from-tracker ITEM-789) + optional poller. (P1-S4: from-tracker/to-tracker + opt-in AQ_TRACKER_AUTO auto-echo; a standalone poller is deferred.)
Acceptance: filing a tracker task, marking it agent:run, results in a queued job; on ship, the item flips to done with a SHA comment.
Verify gate: adapter e2e against a tracker-service test instance (or mock); round-trip assertion.

Stage → tracker status mapping (tracker's enum is coarser than the fleet's; keep fine-grained stage in a label + comment so no detail is lost):

Fleet stage	Tracker `status`	Extra
`queued` / `assigned` / `blocked`	`in_progress`	label `fleet:<stage>`
`building` / `review` / `testing`	`in_progress`	label `fleet:<stage>` + progress comment
`shipped`	`done`	comment with SHA(s)/PR link/verify result
`failed` / `dead_letter`	`in_progress` + label `needs-triage`	never auto-`closed`/`wont_fix` (humans decide)

Ship semantics (PR flow): shipped = change merged to target branch with CI green (default), OR pr-opened when review-policy defers merge to humans/CI — configurable per profile. This honors the non-goal that CI still gates merges (§3); the agent never bypasses branch protection.

Phase 2 — Native spine

Stand up a fleet (a.k.a. orchestrator) module inside platform-service, sibling to tracker-service: pattern types.ts → repository.ts → routes.ts, ESM, Cosmos, productId, req.log.
Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats.
Runners (bash + any) become API clients of this module; tracker adapter calls it directly.
Acceptance: a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos.
Verify gate: module test suite (repository + routes) using the shared @bytelyst/testing inject helpers.

Phase 3 — Unified control plane

Add a Fleet surface to tracker-web reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, live log streaming, lease/heartbeat status, cost burndown, approve/ship buttons.
Streaming caveat (correctness): live logs must not use the existing buffering catch-all proxy /api/tracker/[...path] — it does res.text() and would never stream. Use a dedicated Next.js Route Handler returning a ReadableStream (SSE) or a direct SSE/WebSocket to platform-service. Full logs are shipped to blob storage (§17); the endpoint serves stored tail + live append.
The Node TUI dashboard becomes a thin client of the same /fleet API (parity with web). (devops-tools agent-queue/dashboard.mjs + lib/fleet-dash.mjs, AQ_FLEET_DASH=1.)
Acceptance: an operator can watch all factories + tail any job log + ship from the browser.
Verify gate: web e2e (Playwright) covering fleet map render, live log, and a ship action.

11. Lifecycle & gates at scale (feature)

Canonical stages enforced server-side: queued → assigned → building → review → testing → shipped (+ blocked, failed, dead_letter); transitions validated (illegal transition → 409).
Per-profile default verify; per-job override; verify runs at the factory, result reported as an event.
Human gates: review-policy routes to reviewers; multi-reviewer support (P3).
Dead-letter: after retry.max exhausted, job → dead_letter with full diagnostics; never silently dropped. (P1-S3 single-host stand-in: failed/ result=retries_exhausted, WIP branch + full log preserved.)
Backpressure: when no factory can take more, jobs stay queued (no thrash); SLA timers visible.
Ship semantics are profile-configurable (merged+green vs pr-opened, §10); shipped is terminal-success, dead_letter terminal-failure; blocked (unmet deps) is distinct from queued.
Retry vs idempotency: a retry creates a new fleet_runs attempt under the same job/idempotency-key (never a duplicate job); backoff honored; retry.on filters which failure classes retry. (P1-S3 single-host: attempts counter survives requeue; backoff→next_eligible gates selection; on filters timeout/verify_failed/crash.)
Acceptance: a perpetually-failing job lands in dead_letter after configured retries; a passing one auto-advances to testing then waits for human ship; an illegal transition is rejected.
Verify gate: lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path).

12. Security, safety & governance (feature — critical with `yolo`/dangerous)

Secret isolation: creds live on each factory (env/keychain), never in the queue, manifest, logs, or Cosmos. Factory advertises presence of a cred capability, not the value.
Scoped git tokens per factory/repo; least-privilege; rotation documented.
Push policy: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag.
Blast-radius guardrail: enforce allowed-scope — pre-flight + post-run diff check; out-of-scope changes block the ship gate.
Budget kill-switch: exceed budget (usd/tokens/wall) → pause worker, alert, require human resume.
Supply-chain safety: edits to shared @bytelyst/* packages require reviewer profile + human gate (never auto-ship).
Audit trail: append-only event log per job (who/what/when/where/cost); immutable.
Corp network/proxy: honor NETWORK/proxy + truststore conventions on factories that need them.
Kill switch (global): one command/flag halts all claiming fleet-wide (incident response).
Acceptance: a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test).
Verify gate: security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test.

13. Data model (Cosmos containers, P2+)

Each container partitioned sensibly; every doc has productId.

fleet_jobs (pk /productId) — manifest snapshot + the full instruction body verbatim as markdown (bodyMd), current stage, idempotency-key, tracker-item link, checkpoint pointer (WIP branch/commit). This is the durable source of truth for instructions — a factory holds only a transient materialized copy, so a machine going down loses nothing (§25).
fleet_runs (pk /jobId) — one per execution attempt: factory, engine, profile snapshot, timings, exit, verify result, and execution insights: model, tokensIn/Out (+cached), cost (estimated flag), turns, tool-call counts, filesChanged, linesAdded/Deleted, attempt number (§26).
fleet_leases (pk /jobId) — holder factory, expiresAt, leaseEpoch (fencing), renewals. Reclaim via a coordinator reaper that scans expiresAt < now — Cosmos TTL only garbage-collects stale rows, it cannot trigger reclaim logic. Claim guarded by _etag/If-Match.
fleet_factories (pk /productId) — descriptor, capabilities, health, load, last heartbeat, seat limits.
fleet_profiles (pk /productId) — versioned profile snapshots (immutable per version).
fleet_events (pk /jobId) — append-only audit/event stream (stage changes, log pointers, cost ticks, scheduler decisions).
fleet_artifacts (pk /jobId) — pointers to blob-stored logs + artifacts (coverage, screenshots, build output). Large logs live in @bytelyst/blob, never inline in Cosmos (doc-size + RU limits).
Relate to existing tracker Item via tracker-item (no duplication of planning data).
Optimistic concurrency (_etag) on every job stage transition + lease claim to prevent lost updates / double-assignment. (PR #29: updateIfMatch.)
Indexing/RU: the claim query is hot — index stage, priority, capabilities; avoid cross-partition fan-out; provision RU/s per §22.
Acceptance: repository CRUD + query tests per container; atomic-claim race test (N concurrent claimers → exactly one wins); reaper-reclaim + fencing-rejection test; lease-expiry verified via reaper (not TTL).
Verify gate: repository unit/integration tests (memory + Cosmos provider via DB_PROVIDER).

14. Phased build roadmap (checklists)

Each phase: Goal → checklist → Exit criteria. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress.

Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)

Goal: richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.

Slice progress — P1-S1: manifest parsing (all §5 fields, defaulted + backward-compatible), priority ordering, capability detection+match gate, engine-class resolution, and idempotency-key dedupe are done on the bash runner.

Slice progress — P1-S3 (resilience & insights, single host): crash recovery (recover_orphans + aq recover), git WIP checkpoint/resume (aq/wip/<job>), functional retry policy (backoff + retries_exhausted), and execution insights (parse_usage, per-run metrics in meta, aq insights, status/dash insights) are done — see §11/§25/§26.

Slice progress — P1-S2 (profiles + deps/DAG, single host): the profiles/ catalog + resolution (fm_eff inheritance with job>profile>default precedence, persona injection), the warn-only allowed-scope guardrail (scope_check/path_in_scope), and single-host deps (block-with-reason in selection, status surfacing, submit-time cycle detection) are done — see §5/§6.

Slice progress — P1-S4 (tracker adapter, single host): the task ↔ job round-trip is done (§10) — aq from-tracker materializes a job from a tracker Item (idempotent on tracker-<id>, label→manifest mapping), aq to-tracker echoes status + a metrics-only comment one-way (idempotent via tracker_echoed, never fatal), and opt-in AQ_TRACKER_AUTO auto-echoes on transitions. All HTTP is curl-only through one wrapper (test seam AQ_TRACKER_API_CMD). This closes the Phase-1 §14 tracker-adapter item. Remaining P1 extras: Node-dash surfacing of the new fields. (budget.wall now enforced — see §11 retry/budget line below.)

Extend agent-queue.sh frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible. (P1-S1)
Add profiles/ directory + profile resolution (persona injection, default verify/caps/scope) (§6). (P1-S2)
Local capability detection + a job/factory capability match check before launch (§8 subset). (P1-S1: detect_capabilities + caps_match; mismatch ⇒ failed/ result=capability_mismatch, agent never launched.)
priority ordering in the inbox pick (replace pure FIFO with priority-then-age). (P1-S1: inbox_sorted; per-lock serialization preserved.)
deps (DAG) blocking on a single host; idempotency-key dedupe on add. (P1-S1 idempotency dedupe + P1-S2 deps blocking/cycle detection.)
retry with backoff into failed/requeue; budget.wall enforced (extends timeout). (P1-S3: retry with backoff + retries_exhausted DONE. budget.wall DONE: parsed from budget: { wall: <dur> }, armed as a HARD wall-clock ceiling alongside timeout (whichever fires first binds), expiry → failed result=budget_exceeded, non-retryable by default.)
allowed-scope guardrail (warn-only this phase) + post-run diff report. (P1-S2: scope_check WARN-only + scope_warning=.)
Tracker adapter aq from-tracker <ITEM> + aq to-tracker event poster (§10 P1). (P1-S4: curl-only tracker_api; from-tracker materializes a job (idempotent), to-tracker echoes status+metrics one-way; opt-in AQ_TRACKER_AUTO. A standalone background poller is deferred to P2.)
Dashboard shows profile + priority + capability tags + tracker-item link. (P1-S1: status shows priority/profile/caps/tracker-item; P1-S4: status/insights also show last echoed tracker status; Node dash surfacing pending.)
Update selftest.sh with: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock). (P1-S1 manifest/priority/idempotency + P1-S2 profile/persona/scope/dep-block/cycle + P1-S3 resilience/insights + P1-S4 tracker from/to round-trip via stub.)
Update README + this doc's progress table. (P1-S1)
Exit criteria: all boxes ✅; selftest.sh green; a tracker task → executed → tracker done with SHA comment, fully on one host; no regression to Phase-0 .md files.

Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing

Goal: the service spine; ≥2 real factories executing in parallel via leases.

Slice progress — P2-S3 (factory-agent integration, single host): the bash runner is now a coordinator factory behind AQ_FLEET — lib/fleet-client.sh (curl-only, sourced) registers via heartbeat, claims jobs into inbox (interleaved with local .md), reports fenced stage transitions with WIP checkpoints, renews/releases leases, and on a stale leaseEpoch (reclaimed) self-aborts + quarantines the local result. Coordinator 5xx/connection errors degrade (finish locally) rather than abandon work. When AQ_FLEET is off the offline git-queue path is byte-for-byte unchanged. The remaining P2 items — scheduler/router core, direct tracker→module calls, factory enrollment + scoped tokens, fleet.* feature flags + shadow/dual-run, and the two-factory parallel demo — are now all landed in common-plat (scheduler.ts, tracker-bridge.ts, enrollment.ts).

Scaffold fleet/orchestrator module in platform-service (types/repository/routes, Zod, ESM, productId). (PR #28)
Cosmos containers (§13) + repository layer (memory + Cosmos providers). (PR #28; fleet_artifacts blob wiring still pending.)
Atomic claim (optimistic concurrency / _etag) + lease reaper + fencing (leaseEpoch) endpoints (§4/§8/§9) — not Cosmos-TTL-driven reclaim. (common-plat PR #28 + #29; truly atomic via updateIfMatch.)
Port agent-queue runner to a factory agent API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback. (P2-S3: lib/fleet-client.sh behind AQ_FLEET; registers via heartbeat, claims into inbox, reports fenced stage transitions, renews leases, quarantines on stale-epoch; offline git-queue unchanged when the flag is off.)
Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment. (common-plat fleet/scheduler.ts pure selectJob/scoreCandidate/selectPreemptionVictim; coordinator.ts claimNextJob ranks candidates via selectJob after the capability hard-filter.)
Tracker adapter calls the module directly (not just file export). (common-plat fleet/tracker-bridge.ts + POST /fleet/tracker/ingest / /fleet/tracker/echo: idempotent ingest of a tracker item → job and one-way status echo, in-module.)
Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset). (common-plat fleet/enrollment.ts: enrollFactory/rotateToken/revokeToken issue a plaintext token once, store it hashed, scope it to {productId, factoryId, capabilities}; enforceFactoryToken gates claim/heartbeat in routes.ts.)
Feature flags (fleet.enabled, fleet.route_via_service) + shadow/dual-run vs P1 before cutover (§21). (agent-queue runner: AQ_FLEET / AQ_FLEET_ROUTE / AQ_FLEET_SHADOW with documented precedence; shadow claim/compare/report is side-effect-free (isolated -shadow factoryId + dryRun, never materializes/ships); fleet-shadow-report summarizes AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY + agreement; 60→68 selftest checks.)
Module test suite (repository + routes via @bytelyst/testing); atomic-claim race, crash-recovery, fencing-rejection, reaper-reclaim tests. (PR #28 + #29: 53 fleet + 48 datastore tests, incl. true-concurrency claim.)
Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end. (agent-queue/demo/two-factory-demo.sh + coordinator-stub.sh: two real run daemons (mac-1 + ubuntu-1, separate queues/cwds) compete through one coordinator; asserts (a) no double-assign, (b) kill-mid-job → reaper reclaim → survivor completes → zombie report fenced (409), (c) concurrent parallelism. Dual-mode: CI-safe stateful stub by default, live platform-service when AQ_FLEET_API/AQ_FLEET_TOKEN set. Headless checks in selftest.sh → 68→71 green.)
Exit criteria: all boxes ✅; pnpm --filter @lysnrai/platform-service test green; killing a factory mid-job → another reclaims and completes and the dead worker's late report is fenced; concurrent claimers never double-assign; all state in Cosmos with productId; flag-off rollback verified (§21). — Runtime exit guarantees demonstrated by the two-factory demo (no double-assign + reclaim/fence + parallelism) and flag-off rollback verified (§21). Scheduler/router core, tracker-module direct calls, and factory enrollment + scoped tokens are now all wired in (see boxes above) — Phase 2 is effectively complete. Remaining for a hard 100%: validate the Cosmos _etag CAS path under true production contention + live blob-backed fleet_artifacts.

Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router

Goal: one browser control plane; smart routing + budgets live.

fleet API client in tracker-web (reuse /api/tracker-style proxy → /fleet). (common-plat dashboards/tracker-web/src/lib/fleet-client.ts: typed client over /api/fleet.)
Fleet map page (factories, load, health, capabilities) on @bytelyst/* components. (common-plat app/dashboard/fleet/page.tsx: health badges, load, capabilities, fleet metrics + alerts.)
Job table + job detail + DAG view; live log via SSE; approve/ship/reject/requeue actions. (common-plat app/dashboard/fleet/jobs/page.tsx + jobs/[id]/page.tsx: stage-filtered table, DAG via getJobDag, SSE event stream, ship/requeue/reject/requestReview.)
Cost burndown + budget kill-switch UI; multi-reviewer routing. (common-plat app/dashboard/fleet/budget/page.tsx burndown + pause/resume; ReviewGateCard multi-reviewer quorum gate via requestReview/submitReview.)
Scoring router with configurable weights + explainability surfaced in UI. (common-plat fleet/scheduler.ts tunable weights + GET /fleet/jobs/:id/explain; ExplainPanel breakdown in job detail.)
Preemption of low-priority by critical jobs (checkpoint + requeue). (common-plat fleet/scheduler.ts selectPreemptionVictim + coordinator eviction under FLEET_PREEMPTION; victim requeued with checkpoint + bumped epoch, preempted event.)
TUI dashboard re-pointed at /fleet API (parity). (devops-tools agent-queue/lib/fleet-dash.mjs adapter + dashboard.mjs fleet mode under AQ_FLEET_DASH=1: board/factories/metrics/alerts, job actions ship/requeue/reject via /fleet, per-job events log; opt-in so local mode is byte-for-byte unchanged. Verified by lib/fleet-dash.test.mjs (22 assertions) wired into selftest.sh + live non-TTY render smoke.)
Web e2e (Playwright): fleet map, live log, ship, budget-pause. (common-plat dashboards/tracker-web/e2e/fleet.spec.ts: fleet overview, metrics, job detail, ship, budget-pause, review-gate specs green.)
Exit criteria: all boxes ✅; web verify (typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume.

Phase 4 — Message bus + autoscaling + cross-OS capability marketplace

Goal: scale-out and elasticity.

Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability.
Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA.
Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products.
Load + chaos test suite (factory churn, broker outage, thundering herd).
Exit criteria: all boxes ✅; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll).

Phase 5 — Self-optimizing / learned routing

Goal: the scheduler learns from history to cut time/cost and raise first-pass success.

Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate).
Offline eval harness comparing learned vs heuristic routing on historical data.
Shadow/A-B rollout with guardrails; auto-tune scoring weights.
Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper").
Exit criteria: all boxes ✅; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric.

15. Cross-cutting feature catalog (quick index)

Feature	First phase	Section
Evolved job manifest	P1	§5
Profiles (persona + capability)	P1	§6
Capability matching	P1→P2	§6/§8
Priority + SLA	P1	§5/§7
DAG dependencies	P1→P3	§5/§11
Idempotency / dedupe	P1	§5
Retry + dead-letter	P1→P2	§11
Budgets + kill-switch	P1(wall)→P3	§5/§12
Scheduler/router scoring	P2→P3	§7
Factory registration/heartbeat/lease	P2	§8
Coordinator (platform-service module)	P2	§9/§10
Cosmos data model	P2	§13
Tracker bi-directional sync	P1→P2	§10
Web control plane + SSE logs	P3	§10/§17
Security/scope/secret isolation	P1→P2	§12
Broker + autoscaling	P4	§14
Learned routing	P5	§14
Atomic claim + fencing + distributed lock	P2	§4/§7/§9
Rollout / rollback / feature flags	P2→	§21
Capacity planning & RU/cost	P2→	§22
Ownership & RACI / on-call	all	§23
Work hierarchy & composite delegation (roadmap/epic)	P3 (manual) → P5 (planner)	§24
Durability, crash recovery & work preservation	P1 (orphan/retry/WIP) → P2 (lease/resume)	§25
Execution insights & token accounting	P1 (capture) → P3 (rollup UI)	§26

16. Definition of Done — the "100% accuracy" rubric

A feature/phase is not done until every item below is true (this is the bar for "100% end-to-end"):

Functionality: acceptance criteria met; happy path + documented edge cases handled.
Tests: unit + integration written first or alongside, all green; no weakened/deleted tests; coverage targets met (router ≥95% core).
Verify gate: the phase's named gate command passes locally (and in CI where applicable).
Idempotency & recovery: re-runs are safe; crash mid-step recovers (lease/idempotency).
Security review: secret-leak scan clean; scope guardrail honored; least-privilege tokens.
Observability: events/logs/metrics emitted; failures are diagnosable from the control plane.
Docs: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current.
Backward-compat: existing .md/Phase-0 behavior unbroken (regression check).
Drift checks: shared-infra templates (.npmrc, docker-prep) untouched/synced; conventional commits.
No console.log/print in service code; req.log/os.Logger used; ESM .js imports.

17. Observability & control plane details

Log transport/storage: factory ships logs to blob (@bytelyst/blob); fleet_events carries pointers + a recent-tail buffer. The control plane serves stored tail + live append (via the streaming route, not the buffering proxy — §10).
Live logs via SSE (single stream contract) from the streaming endpoint to web/TUI.
Metrics: queue depth, blocked count, assign latency, claim-loop RU/s, run duration, verify pass-rate, cost, factory utilization, fairness, reclaim/fencing-rejection counts.
Alerting: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter, claim-race anomalies, RU throttling (Cosmos 429s).
Tracing: a job's full timeline (queued→…→shipped) reconstructable from fleet_events (immutable, ordered).
Cost burndown per job/product/day with budget overlays.
SLOs defined + dashboarded (see §19 targets); error budget tracked per SLO.

18. Risks & gaps explicitly tracked (expert call-outs)

Duplicate execution across transports (git fallback + service) — idempotency-key (submit) + atomic lease (assign) + fencing token (side-effect) + distributed lock (push).
Crash recovery — coordinator lease reaper + fencing (not Cosmos TTL); checkpoint long jobs where engines allow.
Split-brain / partition — fencing rejects stale leaseEpoch writes; reclaimed-job results quarantined, not auto-merged (§9).
Shared-package conflicts — two jobs editing @bytelyst/* simultaneously → fleet-wide lock + reviewer gate.
Starvation/fairness — per-product + per-factory counters with penalty.
Cost runaway — budget.wall hard ceiling everywhere; usd/tokens best-effort (provider metering) + global kill switch.
Cosmos RU throttling (429) — hot claim path; bound via long-poll/backoff + indexing (§13/§22); broker offload at P4.
Clock skew — coordinator-authoritative timestamps for all lease/SLA math (§4).
Tool-version drift / reproducibility — record engine + tool versions per run; pin where possible.
Windows quirks — path/shell differences in the factory agent; capability-gate Windows-only work.
Human-review bottleneck — auto-verify as much as possible; batch review UI; reviewer routing.
Result capture beyond commits — artifacts (coverage, screenshots, build logs) attached to runs.
Secret sprawl — never in queue/manifest/logs/Cosmos; presence-only capabilities.
Data retention — event/log retention + archival policy (extend today's clean).
Engine API churn — engines mapped in one place (build_agent_cmd); capability matrix versioned.

19. Success metrics

Each metric has a provisional SLO target (tune with real data; tracked with an error budget):

Dimension	Metric	Provisional SLO target
Throughput	jobs shipped/day; parallel utilization	utilization ≥ 60% under backlog
Quality	% auto-verified; first-pass success; escaped-defect; post-agent human-edit rate	first-pass ≥ 70%; escaped-defect < 2%
Speed	assign latency; time queued→shipped (excl. human gate)	assign p95 < 5s; queue-wait p95 < 2m at target load
Cost	$/shipped job; budget-breach rate	budget-breach < 1% of jobs
Reliability	lease-reclaim success; dead-letter rate; factory uptime; double-execution incidents	reclaim success ≥ 99.9%; double-merge = 0; dead-letter < 5%
Fairness	max/min product wait-time ratio	ratio < 3×
Correctness	atomic-claim violations; fencing rejections functioning	claim violations = 0

Targets are starting points; the §0 owners ratify per-phase SLOs before that phase's exit.

20. Open questions

Copilot headless feasibility as an engine/station (CLI/automation surface?).
Who owns merge/push authority — agents open PRs only, or auto-merge on green for low-risk profiles?
Multi-user/tenant: per-user queues + RBAC in the control plane?
On-call/ownership for the fleet (alerts routing, runbooks)?
Cloud factory provisioning (Phase 4) — which provider/runtime, cost guardrails?
Profile authorship/governance — who can create/edit profiles, and review of persona prompts?

21. Rollout, rollback & data migration

Each phase ships behind controls so it can be turned off without losing work.

Feature-flagged rollout: gate each phase's new path behind a platform feature flag (fleet.enabled, fleet.route_via_service, fleet.tracker_sync); default off; enable per-product first.
Dual-run / shadow: P2 coordinator runs in shadow (assign decisions logged, not executed) alongside the P0/P1 path before cutover; compare decisions. (agent-queue AQ_FLEET_SHADOW=1: offline path stays authoritative, coordinator queried in parallel, decisions classified AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY into .state/fleet-shadow.log; strictly side-effect-free — never ships/quarantines/mutates real job state.)
Cutover is reversible: a factory can fall back from service-claim to git-queue via flag; no schema-destructive step on the rollback path. (rollback = AQ_FLEET_ROUTE=0 and/or AQ_FLEET=0 at any time → instant return to the local/offline path; no data migration.)
Data migration: introducing Cosmos containers (P2) is additive — no migration of existing tracker data; backfill is read-only (link tracker-item, don't mutate). Container creation is idempotent (registered in cosmos-init).
Backward-compat gate: every phase re-runs Phase-0 selftest.sh + a corpus of legacy .md files (regression).
Rollback drill: each phase's exit includes a tested rollback (flag off → prior behavior, in-flight jobs drain or requeue cleanly).
Acceptance: flipping fleet.* flags off returns the system to the prior phase's behavior with zero data loss; in-flight jobs either complete or requeue.
Verify gate: rollout/rollback drill documented + a flag-off regression run is green.

22. Capacity planning & cost

Concurrency model: fleet throughput = Σ factory free-stations, bounded by per-engine seat limits (e.g. N Devin seats) — document seat inventory per engine before P2.
Cosmos RU budgeting: the claim/heartbeat paths are the hot loops. Estimate RU/s = (factories × claim-poll rate × query RU) + (factories × heartbeat rate × upsert RU); pick long-poll interval to keep steady-state RU within a provisioned budget; enable autoscale RU with a ceiling + 429 alerting.
Polling vs push: at F factories the poll RU grows linearly — define the F threshold that triggers the P4 broker migration.
Blob storage: logs/artifacts sizing + lifecycle (hot → cool → delete) per retention policy (§18).
Factory sizing: per-OS resource baseline (CPU/RAM/disk for N concurrent agent sessions + warm checkouts); disk pressure as a health input.
Cost guardrails: per-product spend caps + alerts; ties to budget and the global kill-switch.
Acceptance: a documented capacity sheet (seats, RU/s, blob GB, factory specs) sized for the target steady-state + 2× burst.
Verify gate: load test sustains target throughput within the RU/cost budget (no 429 storms).

23. Ownership & RACI

Owners are roles, not names — assign before each phase starts (this removes the "undefined owner" gap).

Area	Responsible (R)	Accountable (A)	Consulted (C)	Informed (I)
Runner / factory agent (bash)	DevOps eng	Platform lead	—	All
Coordinator module (platform-service)	Backend eng	Platform lead	Security	All
Scheduler/router	Distributed-systems eng	Platform lead	Backend	All
Control plane (tracker-web Fleet)	Frontend eng	Platform lead	UX	All
Security/governance	Security eng	Security lead	Platform	All
Capacity/cost & SLOs	SRE	Platform lead	Finance	All
Profiles & persona governance	Eng leads	Platform lead	—	All

Each phase names its R/A before kickoff; SLOs (§19) ratified by A.
On-call + runbooks established before the fleet runs unattended yolo workloads (Phase 2+).

24. Work hierarchy & composite delegation (roadmap / epic)

Goal: delegate work at any granularity — a single bug/feature/task, or an entire roadmap — and let the fleet decompose + orchestrate rather than hand a multi-day roadmap to one agent session (which is long-horizon, low first-pass-success, and high blast-radius under yolo).

24.1 Two delegation modes

Atomic (today's model): one leaf item (bug/feature/task) → one job → one agent at one station.
Composite (new): a roadmap/epic → a planner profile expands it into child jobs → the scheduler runs them as a DAG across factories/agents/profiles, honoring deps + phase gates. "Delegate the whole roadmap" = hand it to the orchestrator, which fans out — never one agent grinding for hours.

24.2 Job `kind` — the one genuinely new concept

A new axis, orthogonal to tracker type:

kind: leaf — runs an engine at a station (everything Phase 1–2 already does).
kind: composite — runs the planner/orchestrator that emits child leaf jobs and a dependency graph; it never itself edits a repo.

The scheduler (§7) routes by kind: leaf → station/engine; composite → planner. This keeps execution and planning cleanly separated.

24.3 Hierarchy & relationships

parentId links a child job/item to its roadmap/epic; deps (§5) expresses ordering within it (DAG, submit-time cycle detection).
A roadmap is, mechanically, a named DAG of jobs + a rollup — it reuses deps, profiles (§6), the scheduler (§7), and the lifecycle (§11); the only additions are kind, parentId, and rollup logic.
Add a planner/architect/tech-lead profile (§6 catalog) for decomposition + orchestration; leaf work still uses backend-engineer, ux-designer, etc.

24.4 Rollup semantics (composite-level)

Status rollup: roadmap status is derived from children — in_progress once any child starts; shipped/done only when all children reach shipped; surfaces blocked/failed children for triage.
Budget rollup: roadmap budget = Σ child budgets with an explicit ceiling; breaching the ceiling pauses fan-out (ties to §12 kill-switch).
Verify rollup: each leaf runs its own verify; the roadmap's acceptance gate runs after all leaves pass (e.g. an integration/e2e gate).
Phase gates: the roadmap's own phase Exit-criteria become runtime gates — fan-out of phase N+1 is blocked until phase N's children ship; human approval between phases is the default for yolo safety.
Idempotent re-run: re-running a roadmap skips already-shipped children (content-hash dedupe, §5); only unfinished/changed children re-queue.

24.5 Source-of-truth & sync (no drift)

Composite work obeys the same SoT discipline as the core contract (§4 immutable manifest) and the tracker echo (§10): a roadmap/epic is one record referenced by many, never duplicated.

The roadmap/epic is the SoT for what/why + rollup status; each leaf job/run is the SoT for its execution.
Children reference the parent by parentId; the planner writes the child set once at decomposition (immutable manifest snapshot). Re-planning creates a new revision, it does not mutate in-flight children.
Status flows one way, child → parent → tracker (the §10 echo); humans never hand-edit rollup state.

24.6 Decision — Hybrid (recorded)

Model composite delegation in the fleet layer now; defer the shared-platform enum change until proven.

Now (fleet-owned): add kind (leaf/composite), parentId, and rollup to the fleet_jobs schema (§13). The fleet owns this schema outright — no cross-product risk.
Tracker stays bug/feature/task (the shared ITEM_TYPES used by all 9 products is unchanged). A roadmap is represented by a parent item + label kind:roadmap + parentId on children — zero platform migration, no sign-off needed.
Later (optional, gated on proven value): promote kind:roadmap → a first-class epic tracker type via an additive migration (backfill items where labels contains kind:roadmap into type: epic, keep the label as an alias during transition). Low-risk because the behavior already works fleet-side.
Rationale: avoids a speculative 9-product platform change (UI/filters/stats/tests) before the orchestration model is validated; if the model is wrong, only fleet code is refactored, not a platform enum every product depends on.

24.7 Phasing & gates

P1–P2: leaf-only (no composite); kind defaults to leaf.
P3: composite scheduling + rollup + DAG view in the control plane, with manual decomposition (a human/author defines the child set).
P3→P5: the auto-decomposition planner agent (itself a composite job run by the planner profile) — start manual, automate once trustworthy.
Acceptance: a roadmap with N child jobs fans out across ≥2 factories, respects deps + phase gates, rolls up status/budget correctly, and a re-run skips shipped children; tracker shows the parent moving in_progress → done via the one-way echo.
Verify gate: composite-orchestration tests — DAG expansion, rollup status/budget, phase-gate blocking, idempotent re-run; control-plane e2e for the roadmap DAG view.

25. Durability, crash recovery & work preservation

Goal: a machine power-off, daemon/agent crash, or network partition never loses the job, its instructions, or in-progress work, and never corrupts state. Recovery is automatic and idempotent.

25.1 Instructions are durable (markdown in Cosmos)

The full job instruction body is persisted verbatim as markdown in fleet_jobs.bodyMd (§13), alongside the structured manifest. The originating tracker Item.description also retains the human instruction text; the two are linked by tracker-item, never duplicated as competing truth (§24.5).
A factory only ever holds a transient materialized copy (temp prompt file) fetched from the API — losing the factory loses nothing. On the offline edge, the .md file on disk is the durable copy and reconciles on reconnect (§9).

25.2 Work-in-progress is preserved (checkpointing)

For a git-repo cwd, the worker commits WIP to a dedicated branch aq/wip/<jobId> at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to main/protected branches (§12 push policy). (P1-S3: _wip_start/_wip_checkpoint + EXIT/INT/TERM trap; non-git cwd skipped.)
fleet_jobs.checkpoint records the WIP branch + last commit so any worker can find it. (P2 Cosmos; single-host records wip_branch/wip_base/wip_commit in <job>.meta.)
Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window. (P1-S3: start + every-exit-path commits bound the loss window.)

25.3 Recovery is automatic, resumable, and fenced

Orphan detection: on coordinator/runner startup (and continuously), a job in building/assigned whose worker is dead (no live lease / dead pid) is an orphan; it is recovered, not stranded. (P1-S3: recover_orphans on run startup + each loop, and agent-queue.sh recover; dead-pid + pidstart reuse guard.)
Resume vs restart: recovery starts a new fleet_runs attempt; if aq/wip/<jobId> exists, the new worker resumes from the checkpoint instead of restarting from zero. (P1-S3: relaunch checks out aq/wip/<job>; attempts incremented.)
Fencing (§4): the reclaimed run gets a higher leaseEpoch; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of visible outcomes. (P2 — distributed leasing; out of single-host scope.)
Retry policy (retry.max/backoff/on): agent rc≠0 / timeout / verify_failed requeue with backoff up to max; on exhaustion → dead_letter (P2) / failed (P1 stand-in) with full diagnostics — never silently dropped. (P1-S3 single-host.)
State integrity: all run state is append-only / optimistic-concurrency guarded (§13); recovery is idempotent (running it twice yields one recovery). (P1-S3 single-host: meta is append-only + re-derivable from folder location; _etag guard is P2.)

25.4 Crash taxonomy (all handled)

Failure	Detection	Recovery
Agent process crash (`rc≠0`)	exit code	retry policy → requeue or `failed`/`dead_letter`
Daemon/runner crash	lease not renewed	reaper reclaims → resume from checkpoint
Machine power-off / partition	missed heartbeats + lease expiry	reaper + fencing + WIP resume elsewhere
Coordinator restart	state in Cosmos	leases survive; in-flight reconciled on boot

Acceptance: SIGKILL an agent and power-off a factory mid-run → another worker resumes from the last checkpoint (not from zero) and ships; instructions intact (read back from Cosmos bodyMd); zero duplicate commits/merges; a retry-exhausted job lands in dead_letter/failed with diagnostics.
Verify gate: chaos tests — kill agent, kill runner, simulate partition; assert resume-from-checkpoint, fencing rejection of the stale worker, instruction integrity, and no double-merge.

26. Execution insights & token accounting

Goal: per-job/run visibility into token usage, cost, model, latency, and tool activity — to drive budgets (§5/§12), cost burndown (§17), and learned routing (§14 P5).

Per-run telemetry record (in fleet_runs, streamed as fleet_events): engine, model, tokensIn/Out (+cached), cost USD (estimated:true when not provider-reported), wall + CPU time, turn count, tool-call counts, verify pass/fail, filesChanged, linesAdded/Deleted, attempt number, retries. (P1-S3 single-host: recorded in <job>.meta — duration_s, files_changed/lines_added/lines_deleted, tokens/cost/turns/tool_calls, attempts; CPU time not captured.)
Token source (honest feasibility): capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise estimate from log heuristics and mark estimated — same caveat as budget.usd/tokens (§5). A single parse_usage(engine, log) adapter centralizes per-engine extraction. (P1-S3: parse_usage adapter; generic AQ_USAGE line + Claude/Codex heuristics; Devin/Copilot TODO; usage_estimated flag, never fabricated.)
Aggregation/rollups: per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14). (P1-S3 partial: aq insights does per-job + per-engine rollup; product/factory/profile/day are P2/P3.)
Surfacing: control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present. (P1-S3 partial: edge CLI aq insights + status/dash insights line done; web control-plane panels are P3.)
Privacy: telemetry carries metrics + pointers only — never prompt content or secrets (redaction §12). (P1-S3: insights/meta record only metrics; no prompt body or secrets added.)
Acceptance: after a run, its fleet_runs carries token/cost/duration/tool/diff metrics (real where metered, flagged estimated otherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone.
Verify gate: telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified.

This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.

68 KiB Raw Blame History Unescape Escape