Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to discover. Update the README links, the overview code-map, and all phase job-spec source-of-truth paths to the new location. Pure docs move; no behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
68 KiB
Agent Gigafactory — Vision & Implementation Roadmap
One-liner: Evolve today's single-host
agent-queuebash runner into a distributed gigafactory — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler auto-picks jobs from a shared inbox and routes each.mdto the best factory × tool × profile — built service-side onplatform-service+tracker-web, with the bash runner surviving as the offline edge agent.
How to use this doc: It is both a PRD and an execution checklist. Every feature is a
- [ ]checkbox with acceptance criteria and a verify gate. A phase is "100% done" only when every box is checked, its gate passes, and the phase Definition of Done rubric (§16) is green. Update the progress table (§0) as you go.
0. Progress tracker
| Phase | Theme | Status | % | Gate |
|---|---|---|---|---|
| 0 | Baseline (today) | ✅ shipped | 100% | selftest.sh green |
| 1 | Manifest + profiles + capabilities + tracker adapter (single host) | ✅ done | ~98% | adapter e2e + selftest |
| 2 | Coordinator as platform-service module + Cosmos + multi-factory leasing | ✅ done | ~98% | fleet e2e + module tests |
| 3 | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ✅ done | 100% | web e2e + router tests |
| 4 | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite |
| 5 | Self-optimizing / learned routing | ☐ not started | 0% | offline eval + A/B |
Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19. For the full current-state architecture, diagrams, code map, next steps and known gaps see GIGAFACTORY_SYSTEM_OVERVIEW.md (companion doc).
1. Vision & metaphor
A gigafactory turns raw intent (.md task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network:
| Term | Meaning |
|---|---|
| Fleet | The whole network of machines under one control plane. |
| Factory | One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity. |
| Station | A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent). |
| Worker | A single running agent process executing one job at a station. |
| Job | A unit of work: a prompt/.md + manifest (profile, scope, gates, budget). |
| Profile | The role doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt + capability requirements. |
| Capability | A tag a factory advertises and a job requires (os:mac, has:xcode, has:figma, gpu, engine:devin). |
| Lease | A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery). |
| Gate | A checkpoint a job must pass: auto-QA verify, human review, ship approval. |
| Artifact | Any captured output: commits/PRs, logs, screenshots, reports, build outputs. |
North star: drop work into one inbox (or file a tracker task), and the fleet figures out where (factory), with what (tool/engine), as whom (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship.
┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐
│ plan/intake · roadmap · Fleet map · live logs · cost · approvals │
└───────────────▲───────────────────────────────────┬─────────────────────────┘
│ REST/SSE │
┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐
│ queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos) │
└───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘
│ claim/lease/report │ │ │
┌───────┴───────┐ ┌────────┴───────┐ ┌────────┴───────┐ ┌───────┴────────┐
│ FACTORY: mac │ │ FACTORY: ubuntu│ │FACTORY: windows│ │ FACTORY: mac-2 │
│ devin, claude │ │ codex, claude │ │ copilot, codex │ │ devin (xcode) │
│ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │ │ [agent-queue] │
└───────────────┘ └────────────────┘ └────────────────┘ └────────────────┘
2. Current state (Phase 0 baseline — already shipped)
Today's agent-queue.sh + dashboard.mjs (single host, zero-dep bash + Node):
- Folder kanban lifecycle:
inbox → building → review → testing → shipped(+failed). - Auto-QA gate: agent rc=0 →
review/; optionalverify:runs incwd→ passtesting/, failfailed/; no verify → parks inreview/. Manualship= the human gate. - Per-job frontmatter:
engine(devin/claude/codex),cwd,yolo(→ dangerous/auto-approve),lock(per-repo serialization),timeout,verify. - Concurrency:
AGENT_QUEUE_MAX(default 3), per-lockserialization so same-repo jobs never collide. - State & logs:
.state/<job>.metaheartbeats +logs/<job>.log; git-tracked queue (audit-by-commit). - Interactive dashboard: numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to
agent-queue.sh.
Carries forward: the .md-in-inbox UX, frontmatter contract, lifecycle stage names, verify gate, lock/affinity concept, the bash runner itself (becomes the factory agent).
Must change for the fleet: single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane.
- Phase 0 complete — baseline shipped and self-tested. (reference, not a work item)
3. Goals & non-goals
Goals
- One intake, many machines: parallel execution across heterogeneous OS/tools.
- Automatic routing to the best
factory × tool × profilewith affinity, fairness, budget, and health awareness. - Self-healing (lease expiry/requeue), quality gates, and full observability.
- Reuse the ByteLyst stack (
platform-service, Cosmos,@bytelyst/*, tracker-web) — no parallel infra. - Preserve offline/zero-dep edge operation via the bash runner.
Non-goals
- Not a CI/CD replacement (it triggers CI; CI still gates merges).
- Not a general-purpose workflow engine (scoped to coding-agent execution).
- Not a model/inference host (it orchestrates agent CLIs, doesn't serve models).
- Not abandoning the simple
.mdmental model — humans still drop files / file tasks.
4. Core concepts contract (must hold across all phases)
- Every job has a stable id, an immutable manifest, and an append-only event log.
- Every Cosmos document carries
productId(ByteLyst rule). - A job in flight is always covered by exactly one lease; no live lease → reclaimable.
- Atomic claim: a job is assigned to exactly one worker via optimistic concurrency (Cosmos
_etag/If-Matchor a conditionalfleet_leasesinsert keyed byjobId). Concurrent claimers — exactly one wins; losers retry the next candidate. - Fencing token: every lease carries a monotonic
leaseEpoch. Every report/commit/ship carries its epoch; the coordinator rejects writes from a stale epoch, so a partitioned or zombie worker cannot corrupt state after its lease was reclaimed. - Coordinator-authoritative time: all lease/TTL/SLA math uses server timestamps, never factory clocks (clock-skew safety).
- Lifecycle stages are canonical and shared:
queued → assigned → building → review → testing → shipped(+blocked,failed,dead_letter). - The bash runner and the service speak the same manifest + event vocabulary (one schema, two transports).
Implementation status (2026-05-29) — Phase 2 Foundation merged (common-plat PR #28,
platform-service/src/modules/fleet/): all 7fleet_*containers (§13) ✓; repositories + coordinator (claim/lease/fence/heartbeat/reaper) ✓; idempotency + deps + submit-time cycle detection ✓; 50 module tests green. ✓ P0 hardening landed (2026-05-29, common-plat PR #29) — atomic claim is now truly concurrency-safe. AddedupdateIfMatchto@bytelyst/datastore: Cosmos conditions the replace on_etagviaaccessCondition {type:'IfMatch'}(412 → conflict) plus a rev compare for the pre-read window; the Memory provider doesget→compare→setin one synchronous block (noawaitbetween), so concurrent callers cannot interleave.fleetrevUpdate*now write conditionally. Proven byPromise.all2-contender + N-claimer stress + concurrentclaimNextJob/lease-renew tests (these fail on the old read-check-write, pass now). datastore 48 + fleet 53 green; full workspace build/test clean; no consumer regressed. P2-S3 (factory integration) is now unblocked.
5. The evolved Job manifest (feature)
Extend today's frontmatter into a richer, backward-compatible manifest. Old .md files keep working (new fields optional with sane defaults).
---
# --- existing (unchanged) ---
engine: devin # explicit engine; overrides profile/engine-class
cwd: /abs/path/repo
yolo: true
lock: my-repo
timeout: 45m
verify: pnpm -s test
# --- new ---
profile: backend-engineer # role: persona + capability requirements
engine-class: agentic-coder # abstract; scheduler picks a concrete engine if `engine` unset
capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy
prefers: [factory:mac-2] # soft routing hints (affinity)
priority: high # critical|high|medium|low → SLA + preemption
budget: { usd: 5, tokens: 2M, wall: 4h } # wall = HARD ceiling (always enforceable). usd/tokens = best-effort
# caps: enforced only where the engine/provider exposes live metering;
# otherwise estimated from provider usage APIs post-hoc + alerted.
deps: [job-123, job-456] # DAG: don't start until these reach `shipped`/`testing`
idempotency-key: nomgap-ux-2 # dedupe: a second identical submit is a no-op
retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] }
review-policy: manual # auto|manual|reviewers:[@alice]
artifacts: [coverage, screenshots] # what to capture beyond commits
tracker-item: ITEM-789 # link back to the originating tracker task
---
- Define the manifest schema (Zod in the service; documented YAML for
.md). - Backward-compat: a Phase-0
.md(onlyengine/cwd/yolo) parses with all new fields defaulted. (P1-S1: bash runner; Zod schema still P2. selftest backward-compat case green.) - Capability grammar defined: tokens are
key(presence, e.g.has:xcode),key:value(e.g.os:mac,engine:devin), orkey<op>versionwithop ∈ {>=,>,=,<=,<}(e.g.node>=20).os:anyis a wildcard that matches every factory. A job matches a factory iff every required token is satisfied by the factory descriptor. (P1-S1:caps_match/detect_capabilitiesinagent-queue.sh.) engine-classtaxonomy defined as an enum (agentic-coder,chat-coder,review-only) with a documented engine→class map (devin,claude,codex → agentic-coder;copilot → chat-coder). Ifengineis set it wins; else the scheduler picks any free engine in the class honoringprefers-engine. (P1-S1:resolve_engine;review-onlymapping reserved.)idempotency-keysemantics:key + content-hashidentical ⇒ no-op (returns existing job). Samekey, different content ⇒ rejected with 409 unless the prior job is stillqueued/blocked(then it is superseded). A re-run/retryof an existing job is not a new submit and never trips dedupe. (P1-S1: add-time dedupe; bash maps "409" → clear error,queued→ still ininbox/⇒ superseded.)depssemantics: a dep is satisfied when it reachesshipped(default) ortestingifdeps-mode: soft. Submit-time cycle detection rejects cyclic graphs; unmet deps put the job inblocked(notqueued). Cross-factory deps require the coordinator (P2); single-host deps work in P1. (P1-S2:deps_unmetskip-with-reason in selection +statussurfacing;deps_would_cycleonadd. Cross-machine deps remain P2.)- Acceptance: a manifest fixture suite parses/validates; invalid manifests fail with precise errors; capability-grammar + dep-cycle + idempotency-conflict cases covered.
- Verify gate: schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases + grammar/cycle/409 cases).
6. Profiles — persona + capability (feature)
A profile = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as profiles/<name>.md (Phase 1) → Cosmos profiles container (Phase 2).
# profiles/backend-engineer.md
---
name: backend-engineer
persona: |
You are a senior backend engineer. Favor minimal, well-tested changes...
capabilities: [node>=20, has:pnpm]
default-verify: pnpm -s typecheck && pnpm -s test
engine-class: agentic-coder
prefers-engine: [devin, claude]
allowed-scope: ["backend/**", "packages/**"] # blast-radius guardrail
review-policy: manual
---
- Author starter catalog:
developer,backend-engineer,frontend-engineer,ux-designer,ui-designer,qa,reviewer,docs-writer. (P1-S2:profiles/*.md+ a reservedplanner.) - Persona overlay is prepended to the job body before the agent runs; secrets are never written to logs or the event stream (redaction at the source). (P1-S2:
profile_personaprepended to the stripped body file.) - Profile supplies default
verify,capabilities,engine-class,allowed-scopewhen the job omits them. (P1-S2:fm_eff— alsoprefers-engine+review-policy; job fields always override.) - Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time). (P2 — needs Cosmos snapshot at assign time.)
allowed-scopeenforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check). (P1-S2:scope_checkpost-run WARN-only +scope_warning=in meta;path_in_scopeunit-testable.)- Acceptance: a job with
profile: backend-engineerand noverifyinherits the profile's verify + persona. - Verify gate: profile-resolution unit tests; persona-injection golden test.
7. The scheduler / router (the heart) (feature)
Given a queued job and the current fleet, choose (factory, station/engine, profile) and issue a lease.
Inputs: job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters.
Algorithm (deterministic, explainable):
- Filter factories by hard capability match (job ∪ profile capabilities ⊆ factory capabilities) and free station for a compatible engine.
- Block if
depsunmet orlockalready held → leavequeued/blocked. - Score each candidate factory:
score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health − w6·starvationPenalty - Tie-break: highest priority job first; then oldest; then lowest cost class.
- Assign atomically → create the lease under an optimistic-concurrency guard (
_etag/If-Matchor conditional insert keyed byjobId) with a freshleaseEpoch; on conflict another factory won → retry the next candidate. Set jobassigned, decrement station/seat capacity, bump fairness counter. Use coordinator-authoritative timestamps only. - Preemption (P3+): a
criticaljob may pause alowjob at a needed station (checkpoint + requeue, bumping the preempted job'sleaseEpoch).
Phasing: Phase 2 ships the deterministic filter + atomic-assign core (fixed weights). Phase 3 adds tunable weights, preemption, and the explainability UI. Phase 5 learns the weights (§14).
- Implement pure, unit-testable scoring function (no I/O) with configurable weights.
- Hard-filter correctness: never assign a job to a factory missing a required capability.
- Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware).
- Fairness: no factory or product starves under sustained load (counter + penalty).
- Explainability: every assignment records why (matched caps, score breakdown) in the event log.
- Determinism: same inputs → same decision (seeded tie-breaks) for testability.
- Define factory health ∈ [0,1] = f(heartbeat freshness, recent run failure-rate, resource pressure); factories below a health floor are filtered out, not merely down-weighted.
- Station/seat capacity: a factory's free stations =
min(host slots, per-engine seat limits)(e.g. licensed Devin/Claude seats); the scheduler never over-subscribes a seat-limited engine. - Distributed lock: the Phase-0 local
lockbecomes a coordinator-held lock so same-lockjobs serialize across the whole fleet (prevents two factories pushing the same repo concurrently). - Acceptance: scenario fixtures (10+) produce expected assignments incl. starvation, capability-miss, seat-exhaustion, unhealthy-factory exclusion, and budget-exceed; a concurrent-claim race test proves exactly one winner.
- Verify gate: router unit suite ≥ 95% branch coverage on the scoring/filter core; atomic-claim race test.
8. Factory model & registration (feature)
Each machine runs a factory agent (the evolved agent-queue runner) that registers, heartbeats, claims jobs, and reports events.
- Capability auto-detection at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
- Enrollment / bootstrap trust: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a scoped, rotatable factory token (
joseJWT); decommission = revoke. No standing shared secret in the queue. - Registration:
POST /fleet/factorieswith descriptor → receives a factory id + token. - Heartbeat: periodic
PUT /fleet/factories/:id/heartbeat(load, free stations, health). A coordinator lease reaper (not Cosmos TTL) sweepsexpiresAt < nowand reclaims, bumpingleaseEpochso the dead/zombie worker is fenced; a factory missing N heartbeats is markedofflineand all its leases reclaimed. - Claim loop:
POST /fleet/leases/claimadvertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL +leaseEpoch. Use claim backoff / long-poll to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4. - Report: stream stage/log/event back (
POST /fleet/runs/:id/events), echoingleaseEpoch(stale epoch → 409, worker self-aborts); renew lease while alive. - Environment prep: before
verify, the factory ensures deps are installed (cold checkout →pnpm install); prep time counts againstbudget.wall. - Graceful drain: factory can stop claiming, finish in-flight, deregister.
- Acceptance: a factory enrolls, claims a matching job, heartbeats, completes; a killed factory's job is reclaimed by another within the lease TTL and the killed worker's late report is rejected by fencing.
- Verify gate: factory-agent integration test against a mock coordinator; crash-recovery + fencing-rejection test.
9. Coordination architecture (decision + path)
Three transports were evaluated. Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| (a) Git-synced queue (evolve folders) | zero infra, audit-by-commit, offline | weak/racey leasing, latency, merge churn | Edge/offline only |
| (b) Coordinator service (platform-service module) | real leases, fairness, observability, reuses auth/Cosmos/productId | a service to run | Chosen spine (P2) |
| (c) Message broker (NATS/Redis/SQS) | scale, backpressure, push dispatch | most moving parts/ops | P4 when throughput demands |
- Document the decision + rationale in-repo (this section is the canonical record).
- Define the claim/lease protocol once; both git-queue (poll) and service (API) implement it.
- Split-brain / network-partition safety: a partitioned factory may keep running and even
git push.idempotency-keydedupes submits but cannot undo side-effects. Mitigation: fencing — the coordinator rejectsship/merge reports from a staleleaseEpoch, and the distributedlock(§7) prevents a reclaimed-job's twin from pushing the same repo. Residual risk (a stale push to a feature branch) is contained by the PR-merge ship gate (§10) and surfaced for human triage. - Offline-degrade: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect; on reconnect it presents its
leaseEpoch— if reclaimed, its results are quarantined, not auto-merged. - Poll cost: bound claim-loop RU via long-poll/backoff (§22); migrate to broker push at P4.
- Acceptance: the same job manifest runs identically through the bash/git path and the service path; a simulated partition does not double-merge (fencing test).
- Verify gate: contract test asserting protocol parity (git vs service) + partition/fencing test.
10. tracker-web / platform-service integration (committed path)
Layering: tracker = WHAT/WHY (plan, intake, prioritize, roadmap, votes) · gigafactory = HOW (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real tracker-service model (Item: type bug/feature/task, status open/in_progress/done/closed/wont_fix, priority, labels, assignee, source incl. auto_detected, votes, comments, public roadmap) and the tracker-web /api/tracker/[...path] proxy pattern.
Phase 1 — Adapter (no new infra)
- task → job: a tracker
Itemoftype: task(e.g.assignee: @agentor labelagent:run) is exported to a job.md(manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints). (P1-S4:aq from-tracker; labelsengine-class:/profile:/priority:/cap:→ frontmatter.) - job → tracker: lifecycle events post back as status updates + comments —
building→ statusin_progress+ comment "started on factory X";shipped→done+ comment with commit SHAs / PR link / verify results;failed→ comment with reason (status staysin_progressfor human triage). (P1-S4:aq to-trackerPATCHes status + posts a metrics-only comment; one-way echo §24.5; never fatal. The items API has no blocked/failed status, so failures map towont_fixby default — override viaAQ_TRACKER_STATUS_FAILED.) - Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash). (P1-S4: derived
idempotency-key: tracker-<id>reuses Slice 1 dedupe;to-trackeris idempotent viatracker_echoed.) - Adapter is a thin script/CLI (
aq from-tracker ITEM-789) + optional poller. (P1-S4:from-tracker/to-tracker+ opt-inAQ_TRACKER_AUTOauto-echo; a standalone poller is deferred.) - Acceptance: filing a tracker task, marking it
agent:run, results in a queued job; on ship, the item flips todonewith a SHA comment. - Verify gate: adapter e2e against a tracker-service test instance (or mock); round-trip assertion.
Stage → tracker status mapping (tracker's enum is coarser than the fleet's; keep fine-grained stage in a label + comment so no detail is lost):
| Fleet stage | Tracker status |
Extra |
|---|---|---|
queued / assigned / blocked |
in_progress |
label fleet:<stage> |
building / review / testing |
in_progress |
label fleet:<stage> + progress comment |
shipped |
done |
comment with SHA(s)/PR link/verify result |
failed / dead_letter |
in_progress + label needs-triage |
never auto-closed/wont_fix (humans decide) |
Ship semantics (PR flow): shipped = change merged to target branch with CI green (default), OR pr-opened when review-policy defers merge to humans/CI — configurable per profile. This honors the non-goal that CI still gates merges (§3); the agent never bypasses branch protection.
Phase 2 — Native spine
- Stand up a
fleet(a.k.a.orchestrator) module inside platform-service, sibling totracker-service: patterntypes.ts → repository.ts → routes.ts, ESM, Cosmos,productId,req.log. - Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats.
- Runners (bash + any) become API clients of this module; tracker adapter calls it directly.
- Acceptance: a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos.
- Verify gate: module test suite (repository + routes) using the shared
@bytelyst/testinginject helpers.
Phase 3 — Unified control plane
- Add a Fleet surface to
tracker-webreusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, live log streaming, lease/heartbeat status, cost burndown, approve/ship buttons. - Streaming caveat (correctness): live logs must not use the existing buffering catch-all proxy
/api/tracker/[...path]— it doesres.text()and would never stream. Use a dedicated Next.js Route Handler returning aReadableStream(SSE) or a direct SSE/WebSocket to platform-service. Full logs are shipped to blob storage (§17); the endpoint serves stored tail + live append. - The Node TUI dashboard becomes a thin client of the same
/fleetAPI (parity with web). (devops-toolsagent-queue/dashboard.mjs+lib/fleet-dash.mjs,AQ_FLEET_DASH=1.) - Acceptance: an operator can watch all factories + tail any job log + ship from the browser.
- Verify gate: web e2e (Playwright) covering fleet map render, live log, and a ship action.
11. Lifecycle & gates at scale (feature)
- Canonical stages enforced server-side:
queued → assigned → building → review → testing → shipped(+blocked,failed,dead_letter); transitions validated (illegal transition → 409). - Per-profile default
verify; per-job override; verify runs at the factory, result reported as an event. - Human gates:
review-policyroutes to reviewers; multi-reviewer support (P3). - Dead-letter: after
retry.maxexhausted, job →dead_letterwith full diagnostics; never silently dropped. (P1-S3 single-host stand-in:failed/result=retries_exhausted, WIP branch + full log preserved.) - Backpressure: when no factory can take more, jobs stay
queued(no thrash); SLA timers visible. - Ship semantics are profile-configurable (merged+green vs
pr-opened, §10);shippedis terminal-success,dead_letterterminal-failure;blocked(unmet deps) is distinct fromqueued. - Retry vs idempotency: a retry creates a new
fleet_runsattempt under the same job/idempotency-key(never a duplicate job); backoff honored;retry.onfilters which failure classes retry. (P1-S3 single-host:attemptscounter survives requeue;backoff→next_eligiblegates selection;onfilters timeout/verify_failed/crash.) - Acceptance: a perpetually-failing job lands in
dead_letterafter configured retries; a passing one auto-advances totestingthen waits for humanship; an illegal transition is rejected. - Verify gate: lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path).
12. Security, safety & governance (feature — critical with yolo/dangerous)
- Secret isolation: creds live on each factory (env/keychain), never in the queue, manifest, logs, or Cosmos. Factory advertises presence of a cred capability, not the value.
- Scoped git tokens per factory/repo; least-privilege; rotation documented.
- Push policy: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag.
- Blast-radius guardrail: enforce
allowed-scope— pre-flight + post-run diff check; out-of-scope changes block the ship gate. - Budget kill-switch: exceed
budget(usd/tokens/wall) → pause worker, alert, require human resume. - Supply-chain safety: edits to shared
@bytelyst/*packages requirereviewerprofile + human gate (never auto-ship). - Audit trail: append-only event log per job (who/what/when/where/cost); immutable.
- Corp network/proxy: honor
NETWORK/proxy + truststore conventions on factories that need them. - Kill switch (global): one command/flag halts all claiming fleet-wide (incident response).
- Acceptance: a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test).
- Verify gate: security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test.
13. Data model (Cosmos containers, P2+)
Each container partitioned sensibly; every doc has productId.
fleet_jobs(pk/productId) — manifest snapshot + the full instruction body verbatim as markdown (bodyMd), current stage, idempotency-key, tracker-item link,checkpointpointer (WIP branch/commit). This is the durable source of truth for instructions — a factory holds only a transient materialized copy, so a machine going down loses nothing (§25).fleet_runs(pk/jobId) — one per execution attempt: factory, engine, profile snapshot, timings, exit, verify result, and execution insights: model, tokensIn/Out (+cached), cost (estimatedflag), turns, tool-call counts, filesChanged, linesAdded/Deleted, attempt number (§26).fleet_leases(pk/jobId) — holder factory,expiresAt,leaseEpoch(fencing), renewals. Reclaim via a coordinator reaper that scansexpiresAt < now— Cosmos TTL only garbage-collects stale rows, it cannot trigger reclaim logic. Claim guarded by_etag/If-Match.fleet_factories(pk/productId) — descriptor, capabilities, health, load, last heartbeat, seat limits.fleet_profiles(pk/productId) — versioned profile snapshots (immutable per version).fleet_events(pk/jobId) — append-only audit/event stream (stage changes, log pointers, cost ticks, scheduler decisions).fleet_artifacts(pk/jobId) — pointers to blob-stored logs + artifacts (coverage, screenshots, build output). Large logs live in@bytelyst/blob, never inline in Cosmos (doc-size + RU limits).- Relate to existing tracker
Itemviatracker-item(no duplication of planning data). - Optimistic concurrency (
_etag) on every job stage transition + lease claim to prevent lost updates / double-assignment. (PR #29:updateIfMatch.) - Indexing/RU: the claim query is hot — index
stage,priority,capabilities; avoid cross-partition fan-out; provision RU/s per §22. - Acceptance: repository CRUD + query tests per container; atomic-claim race test (N concurrent claimers → exactly one wins); reaper-reclaim + fencing-rejection test; lease-expiry verified via reaper (not TTL).
- Verify gate: repository unit/integration tests (memory + Cosmos provider via
DB_PROVIDER).
14. Phased build roadmap (checklists)
Each phase: Goal → checklist → Exit criteria. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress.
Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)
Goal: richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.
Slice progress — P1-S1: manifest parsing (all §5 fields, defaulted + backward-compatible),
priorityordering, capability detection+match gate,engine-classresolution, andidempotency-keydedupe are done on the bash runner.Slice progress — P1-S3 (resilience & insights, single host): crash recovery (
recover_orphans+aq recover), git WIP checkpoint/resume (aq/wip/<job>), functionalretrypolicy (backoff +retries_exhausted), and execution insights (parse_usage, per-run metrics in meta,aq insights,status/dashinsights) are done — see §11/§25/§26.Slice progress — P1-S2 (profiles + deps/DAG, single host): the
profiles/catalog + resolution (fm_effinheritance with job>profile>default precedence, persona injection), the warn-onlyallowed-scopeguardrail (scope_check/path_in_scope), and single-hostdeps(block-with-reason in selection,statussurfacing, submit-time cycle detection) are done — see §5/§6.Slice progress — P1-S4 (tracker adapter, single host): the task ↔ job round-trip is done (§10) —
aq from-trackermaterializes a job from a tracker Item (idempotent ontracker-<id>, label→manifest mapping),aq to-trackerechoes status + a metrics-only comment one-way (idempotent viatracker_echoed, never fatal), and opt-inAQ_TRACKER_AUTOauto-echoes on transitions. All HTTP is curl-only through one wrapper (test seamAQ_TRACKER_API_CMD). This closes the Phase-1 §14 tracker-adapter item. Remaining P1 extras: Node-dashsurfacing of the new fields. (budget.wallnow enforced — see §11 retry/budget line below.)
- Extend
agent-queue.shfrontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible. (P1-S1) - Add
profiles/directory + profile resolution (persona injection, default verify/caps/scope) (§6). (P1-S2) - Local capability detection + a job/factory capability match check before launch (§8 subset). (P1-S1:
detect_capabilities+caps_match; mismatch ⇒failed/result=capability_mismatch, agent never launched.) priorityordering in the inbox pick (replace pure FIFO with priority-then-age). (P1-S1:inbox_sorted; per-lock serialization preserved.)deps(DAG) blocking on a single host;idempotency-keydedupe onadd. (P1-S1 idempotency dedupe + P1-S2depsblocking/cycle detection.)retrywith backoff intofailed/requeue;budget.wallenforced (extendstimeout). (P1-S3:retrywith backoff +retries_exhaustedDONE.budget.wallDONE: parsed frombudget: { wall: <dur> }, armed as a HARD wall-clock ceiling alongsidetimeout(whichever fires first binds), expiry →failedresult=budget_exceeded, non-retryable by default.)allowed-scopeguardrail (warn-only this phase) + post-run diff report. (P1-S2:scope_checkWARN-only +scope_warning=.)- Tracker adapter
aq from-tracker <ITEM>+aq to-trackerevent poster (§10 P1). (P1-S4: curl-onlytracker_api; from-tracker materializes a job (idempotent), to-tracker echoes status+metrics one-way; opt-inAQ_TRACKER_AUTO. A standalone background poller is deferred to P2.) - Dashboard shows profile + priority + capability tags + tracker-item link. (P1-S1:
statusshows priority/profile/caps/tracker-item; P1-S4: status/insights also show last echoed tracker status; Nodedashsurfacing pending.) - Update
selftest.shwith: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock). (P1-S1 manifest/priority/idempotency + P1-S2 profile/persona/scope/dep-block/cycle + P1-S3 resilience/insights + P1-S4 tracker from/to round-trip via stub.) - Update README + this doc's progress table. (P1-S1)
- Exit criteria: all boxes ✅;
selftest.shgreen; a tracker task → executed → trackerdonewith SHA comment, fully on one host; no regression to Phase-0.mdfiles.
Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing
Goal: the service spine; ≥2 real factories executing in parallel via leases.
Slice progress — P2-S3 (factory-agent integration, single host): the bash runner is now a coordinator factory behind
AQ_FLEET—lib/fleet-client.sh(curl-only, sourced) registers via heartbeat, claims jobs into inbox (interleaved with local.md), reports fenced stage transitions with WIP checkpoints, renews/releases leases, and on a staleleaseEpoch(reclaimed) self-aborts + quarantines the local result. Coordinator 5xx/connection errors degrade (finish locally) rather than abandon work. WhenAQ_FLEETis off the offline git-queue path is byte-for-byte unchanged. The remaining P2 items — scheduler/router core, direct tracker→module calls, factory enrollment + scoped tokens,fleet.*feature flags + shadow/dual-run, and the two-factory parallel demo — are now all landed in common-plat (scheduler.ts,tracker-bridge.ts,enrollment.ts).
- Scaffold
fleet/orchestratormodule inplatform-service(types/repository/routes, Zod, ESM,productId). (PR #28) - Cosmos containers (§13) + repository layer (memory + Cosmos providers). (PR #28;
fleet_artifactsblob wiring still pending.) - Atomic claim (optimistic concurrency /
_etag) + lease reaper + fencing (leaseEpoch) endpoints (§4/§8/§9) — not Cosmos-TTL-driven reclaim. (common-plat PR #28 + #29; truly atomic viaupdateIfMatch.) - Port
agent-queuerunner to a factory agent API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback. (P2-S3:lib/fleet-client.shbehindAQ_FLEET; registers via heartbeat, claims into inbox, reports fenced stage transitions, renews leases, quarantines on stale-epoch; offline git-queue unchanged when the flag is off.) - Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment. (common-plat
fleet/scheduler.tspureselectJob/scoreCandidate/selectPreemptionVictim;coordinator.tsclaimNextJobranks candidates viaselectJobafter the capability hard-filter.) - Tracker adapter calls the module directly (not just file export). (common-plat
fleet/tracker-bridge.ts+POST /fleet/tracker/ingest//fleet/tracker/echo: idempotent ingest of a tracker item → job and one-way status echo, in-module.) - Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset). (common-plat
fleet/enrollment.ts:enrollFactory/rotateToken/revokeTokenissue a plaintext token once, store it hashed, scope it to{productId, factoryId, capabilities};enforceFactoryTokengatesclaim/heartbeatinroutes.ts.) - Feature flags (
fleet.enabled,fleet.route_via_service) + shadow/dual-run vs P1 before cutover (§21). (agent-queue runner:AQ_FLEET/AQ_FLEET_ROUTE/AQ_FLEET_SHADOWwith documented precedence; shadow claim/compare/report is side-effect-free (isolated-shadowfactoryId + dryRun, never materializes/ships);fleet-shadow-reportsummarizes AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY + agreement; 60→68 selftest checks.) - Module test suite (repository + routes via
@bytelyst/testing); atomic-claim race, crash-recovery, fencing-rejection, reaper-reclaim tests. (PR #28 + #29: 53 fleet + 48 datastore tests, incl. true-concurrency claim.) - Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end. (
agent-queue/demo/two-factory-demo.sh+coordinator-stub.sh: two realrundaemons (mac-1 + ubuntu-1, separate queues/cwds) compete through one coordinator; asserts (a) no double-assign, (b) kill-mid-job → reaper reclaim → survivor completes → zombie report fenced (409), (c) concurrent parallelism. Dual-mode: CI-safe stateful stub by default, live platform-service whenAQ_FLEET_API/AQ_FLEET_TOKENset. Headless checks inselftest.sh→ 68→71 green.) - Exit criteria: all boxes ✅;
pnpm --filter @lysnrai/platform-service testgreen; killing a factory mid-job → another reclaims and completes and the dead worker's late report is fenced; concurrent claimers never double-assign; all state in Cosmos withproductId; flag-off rollback verified (§21). — Runtime exit guarantees demonstrated by the two-factory demo (no double-assign + reclaim/fence + parallelism) and flag-off rollback verified (§21). Scheduler/router core, tracker-module direct calls, and factory enrollment + scoped tokens are now all wired in (see boxes above) — Phase 2 is effectively complete. Remaining for a hard 100%: validate the Cosmos_etagCAS path under true production contention + live blob-backedfleet_artifacts.
Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router
Goal: one browser control plane; smart routing + budgets live.
fleetAPI client intracker-web(reuse/api/tracker-style proxy →/fleet). (common-platdashboards/tracker-web/src/lib/fleet-client.ts: typed client over/api/fleet.)- Fleet map page (factories, load, health, capabilities) on
@bytelyst/*components. (common-platapp/dashboard/fleet/page.tsx: health badges, load, capabilities, fleet metrics + alerts.) - Job table + job detail + DAG view; live log via SSE; approve/ship/reject/requeue actions. (common-plat
app/dashboard/fleet/jobs/page.tsx+jobs/[id]/page.tsx: stage-filtered table, DAG viagetJobDag, SSE event stream, ship/requeue/reject/requestReview.) - Cost burndown + budget kill-switch UI; multi-reviewer routing. (common-plat
app/dashboard/fleet/budget/page.tsxburndown + pause/resume;ReviewGateCardmulti-reviewer quorum gate viarequestReview/submitReview.) - Scoring router with configurable weights + explainability surfaced in UI. (common-plat
fleet/scheduler.tstunable weights +GET /fleet/jobs/:id/explain;ExplainPanelbreakdown in job detail.) - Preemption of low-priority by critical jobs (checkpoint + requeue). (common-plat
fleet/scheduler.tsselectPreemptionVictim+ coordinator eviction underFLEET_PREEMPTION; victim requeued with checkpoint + bumped epoch,preemptedevent.) - TUI dashboard re-pointed at
/fleetAPI (parity). (devops-toolsagent-queue/lib/fleet-dash.mjsadapter +dashboard.mjsfleet mode underAQ_FLEET_DASH=1: board/factories/metrics/alerts, job actions ship/requeue/reject via/fleet, per-job events log; opt-in so local mode is byte-for-byte unchanged. Verified bylib/fleet-dash.test.mjs(22 assertions) wired intoselftest.sh+ live non-TTY render smoke.) - Web e2e (Playwright): fleet map, live log, ship, budget-pause. (common-plat
dashboards/tracker-web/e2e/fleet.spec.ts: fleet overview, metrics, job detail, ship, budget-pause, review-gate specs green.) - Exit criteria: all boxes ✅; web
verify(typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume.
Phase 4 — Message bus + autoscaling + cross-OS capability marketplace
Goal: scale-out and elasticity.
- Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability.
- Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA.
- Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products.
- Load + chaos test suite (factory churn, broker outage, thundering herd).
- Exit criteria: all boxes ✅; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll).
Phase 5 — Self-optimizing / learned routing
Goal: the scheduler learns from history to cut time/cost and raise first-pass success.
- Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate).
- Offline eval harness comparing learned vs heuristic routing on historical data.
- Shadow/A-B rollout with guardrails; auto-tune scoring weights.
- Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper").
- Exit criteria: all boxes ✅; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric.
15. Cross-cutting feature catalog (quick index)
| Feature | First phase | Section |
|---|---|---|
| Evolved job manifest | P1 | §5 |
| Profiles (persona + capability) | P1 | §6 |
| Capability matching | P1→P2 | §6/§8 |
| Priority + SLA | P1 | §5/§7 |
| DAG dependencies | P1→P3 | §5/§11 |
| Idempotency / dedupe | P1 | §5 |
| Retry + dead-letter | P1→P2 | §11 |
| Budgets + kill-switch | P1(wall)→P3 | §5/§12 |
| Scheduler/router scoring | P2→P3 | §7 |
| Factory registration/heartbeat/lease | P2 | §8 |
| Coordinator (platform-service module) | P2 | §9/§10 |
| Cosmos data model | P2 | §13 |
| Tracker bi-directional sync | P1→P2 | §10 |
| Web control plane + SSE logs | P3 | §10/§17 |
| Security/scope/secret isolation | P1→P2 | §12 |
| Broker + autoscaling | P4 | §14 |
| Learned routing | P5 | §14 |
| Atomic claim + fencing + distributed lock | P2 | §4/§7/§9 |
| Rollout / rollback / feature flags | P2→ | §21 |
| Capacity planning & RU/cost | P2→ | §22 |
| Ownership & RACI / on-call | all | §23 |
| Work hierarchy & composite delegation (roadmap/epic) | P3 (manual) → P5 (planner) | §24 |
| Durability, crash recovery & work preservation | P1 (orphan/retry/WIP) → P2 (lease/resume) | §25 |
| Execution insights & token accounting | P1 (capture) → P3 (rollup UI) | §26 |
16. Definition of Done — the "100% accuracy" rubric
A feature/phase is not done until every item below is true (this is the bar for "100% end-to-end"):
- Functionality: acceptance criteria met; happy path + documented edge cases handled.
- Tests: unit + integration written first or alongside, all green; no weakened/deleted tests; coverage targets met (router ≥95% core).
- Verify gate: the phase's named gate command passes locally (and in CI where applicable).
- Idempotency & recovery: re-runs are safe; crash mid-step recovers (lease/idempotency).
- Security review: secret-leak scan clean; scope guardrail honored; least-privilege tokens.
- Observability: events/logs/metrics emitted; failures are diagnosable from the control plane.
- Docs: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current.
- Backward-compat: existing
.md/Phase-0 behavior unbroken (regression check). - Drift checks: shared-infra templates (
.npmrc,docker-prep) untouched/synced; conventional commits. - No
console.log/printin service code;req.log/os.Loggerused; ESM.jsimports.
17. Observability & control plane details
- Log transport/storage: factory ships logs to blob (
@bytelyst/blob);fleet_eventscarries pointers + a recent-tail buffer. The control plane serves stored tail + live append (via the streaming route, not the buffering proxy — §10). - Live logs via SSE (single stream contract) from the streaming endpoint to web/TUI.
- Metrics: queue depth,
blockedcount, assign latency, claim-loop RU/s, run duration, verify pass-rate, cost, factory utilization, fairness, reclaim/fencing-rejection counts. - Alerting: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter, claim-race anomalies, RU throttling (Cosmos 429s).
- Tracing: a job's full timeline (queued→…→shipped) reconstructable from
fleet_events(immutable, ordered). - Cost burndown per job/product/day with budget overlays.
- SLOs defined + dashboarded (see §19 targets); error budget tracked per SLO.
18. Risks & gaps explicitly tracked (expert call-outs)
- Duplicate execution across transports (git fallback + service) —
idempotency-key(submit) + atomic lease (assign) + fencing token (side-effect) + distributedlock(push). - Crash recovery — coordinator lease reaper + fencing (not Cosmos TTL); checkpoint long jobs where engines allow.
- Split-brain / partition — fencing rejects stale
leaseEpochwrites; reclaimed-job results quarantined, not auto-merged (§9). - Shared-package conflicts — two jobs editing
@bytelyst/*simultaneously → fleet-widelock+ reviewer gate. - Starvation/fairness — per-product + per-factory counters with penalty.
- Cost runaway —
budget.wallhard ceiling everywhere;usd/tokensbest-effort (provider metering) + global kill switch. - Cosmos RU throttling (429) — hot claim path; bound via long-poll/backoff + indexing (§13/§22); broker offload at P4.
- Clock skew — coordinator-authoritative timestamps for all lease/SLA math (§4).
- Tool-version drift / reproducibility — record engine + tool versions per run; pin where possible.
- Windows quirks — path/shell differences in the factory agent; capability-gate Windows-only work.
- Human-review bottleneck — auto-verify as much as possible; batch review UI; reviewer routing.
- Result capture beyond commits — artifacts (coverage, screenshots, build logs) attached to runs.
- Secret sprawl — never in queue/manifest/logs/Cosmos; presence-only capabilities.
- Data retention — event/log retention + archival policy (extend today's
clean). - Engine API churn — engines mapped in one place (
build_agent_cmd); capability matrix versioned.
19. Success metrics
Each metric has a provisional SLO target (tune with real data; tracked with an error budget):
| Dimension | Metric | Provisional SLO target |
|---|---|---|
| Throughput | jobs shipped/day; parallel utilization | utilization ≥ 60% under backlog |
| Quality | % auto-verified; first-pass success; escaped-defect; post-agent human-edit rate | first-pass ≥ 70%; escaped-defect < 2% |
| Speed | assign latency; time queued→shipped (excl. human gate) | assign p95 < 5s; queue-wait p95 < 2m at target load |
| Cost | $/shipped job; budget-breach rate | budget-breach < 1% of jobs |
| Reliability | lease-reclaim success; dead-letter rate; factory uptime; double-execution incidents | reclaim success ≥ 99.9%; double-merge = 0; dead-letter < 5% |
| Fairness | max/min product wait-time ratio | ratio < 3× |
| Correctness | atomic-claim violations; fencing rejections functioning | claim violations = 0 |
Targets are starting points; the §0 owners ratify per-phase SLOs before that phase's exit.
20. Open questions
- Copilot headless feasibility as an engine/station (CLI/automation surface?).
- Who owns merge/push authority — agents open PRs only, or auto-merge on green for low-risk profiles?
- Multi-user/tenant: per-user queues + RBAC in the control plane?
- On-call/ownership for the fleet (alerts routing, runbooks)?
- Cloud factory provisioning (Phase 4) — which provider/runtime, cost guardrails?
- Profile authorship/governance — who can create/edit profiles, and review of persona prompts?
21. Rollout, rollback & data migration
Each phase ships behind controls so it can be turned off without losing work.
- Feature-flagged rollout: gate each phase's new path behind a platform feature flag (
fleet.enabled,fleet.route_via_service,fleet.tracker_sync); default off; enable per-product first. - Dual-run / shadow: P2 coordinator runs in shadow (assign decisions logged, not executed) alongside the P0/P1 path before cutover; compare decisions. (agent-queue
AQ_FLEET_SHADOW=1: offline path stays authoritative, coordinator queried in parallel, decisions classified AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY into.state/fleet-shadow.log; strictly side-effect-free — never ships/quarantines/mutates real job state.) - Cutover is reversible: a factory can fall back from service-claim to git-queue via flag; no schema-destructive step on the rollback path. (rollback =
AQ_FLEET_ROUTE=0and/orAQ_FLEET=0at any time → instant return to the local/offline path; no data migration.) - Data migration: introducing Cosmos containers (P2) is additive — no migration of existing tracker data; backfill is read-only (link
tracker-item, don't mutate). Container creation is idempotent (registered incosmos-init). - Backward-compat gate: every phase re-runs Phase-0
selftest.sh+ a corpus of legacy.mdfiles (regression). - Rollback drill: each phase's exit includes a tested rollback (flag off → prior behavior, in-flight jobs drain or requeue cleanly).
- Acceptance: flipping
fleet.*flags off returns the system to the prior phase's behavior with zero data loss; in-flight jobs either complete or requeue. - Verify gate: rollout/rollback drill documented + a flag-off regression run is green.
22. Capacity planning & cost
- Concurrency model: fleet throughput = Σ factory free-stations, bounded by per-engine seat limits (e.g. N Devin seats) — document seat inventory per engine before P2.
- Cosmos RU budgeting: the claim/heartbeat paths are the hot loops. Estimate RU/s = (factories × claim-poll rate × query RU) + (factories × heartbeat rate × upsert RU); pick long-poll interval to keep steady-state RU within a provisioned budget; enable autoscale RU with a ceiling + 429 alerting.
- Polling vs push: at F factories the poll RU grows linearly — define the F threshold that triggers the P4 broker migration.
- Blob storage: logs/artifacts sizing + lifecycle (hot → cool → delete) per retention policy (§18).
- Factory sizing: per-OS resource baseline (CPU/RAM/disk for N concurrent agent sessions + warm checkouts); disk pressure as a health input.
- Cost guardrails: per-product spend caps + alerts; ties to
budgetand the global kill-switch. - Acceptance: a documented capacity sheet (seats, RU/s, blob GB, factory specs) sized for the target steady-state + 2× burst.
- Verify gate: load test sustains target throughput within the RU/cost budget (no 429 storms).
23. Ownership & RACI
Owners are roles, not names — assign before each phase starts (this removes the "undefined owner" gap).
| Area | Responsible (R) | Accountable (A) | Consulted (C) | Informed (I) |
|---|---|---|---|---|
| Runner / factory agent (bash) | DevOps eng | Platform lead | — | All |
| Coordinator module (platform-service) | Backend eng | Platform lead | Security | All |
| Scheduler/router | Distributed-systems eng | Platform lead | Backend | All |
| Control plane (tracker-web Fleet) | Frontend eng | Platform lead | UX | All |
| Security/governance | Security eng | Security lead | Platform | All |
| Capacity/cost & SLOs | SRE | Platform lead | Finance | All |
| Profiles & persona governance | Eng leads | Platform lead | — | All |
- Each phase names its R/A before kickoff; SLOs (§19) ratified by A.
- On-call + runbooks established before the fleet runs unattended
yoloworkloads (Phase 2+).
24. Work hierarchy & composite delegation (roadmap / epic)
Goal: delegate work at any granularity — a single bug/feature/task, or an entire roadmap — and let the fleet decompose + orchestrate rather than hand a multi-day roadmap to one agent session (which is long-horizon, low first-pass-success, and high blast-radius under yolo).
24.1 Two delegation modes
- Atomic (today's model): one leaf item (
bug/feature/task) → one job → one agent at one station. - Composite (new): a
roadmap/epic→ a planner profile expands it into child jobs → the scheduler runs them as a DAG across factories/agents/profiles, honoringdeps+ phase gates. "Delegate the whole roadmap" = hand it to the orchestrator, which fans out — never one agent grinding for hours.
24.2 Job kind — the one genuinely new concept
A new axis, orthogonal to tracker type:
kind: leaf— runs an engine at a station (everything Phase 1–2 already does).kind: composite— runs the planner/orchestrator that emits childleafjobs and a dependency graph; it never itself edits a repo.
The scheduler (§7) routes by kind: leaf → station/engine; composite → planner. This keeps execution and planning cleanly separated.
24.3 Hierarchy & relationships
parentIdlinks a child job/item to its roadmap/epic;deps(§5) expresses ordering within it (DAG, submit-time cycle detection).- A roadmap is, mechanically, a named DAG of jobs + a rollup — it reuses
deps, profiles (§6), the scheduler (§7), and the lifecycle (§11); the only additions arekind,parentId, and rollup logic. - Add a
planner/architect/tech-leadprofile (§6 catalog) for decomposition + orchestration; leaf work still usesbackend-engineer,ux-designer, etc.
24.4 Rollup semantics (composite-level)
- Status rollup: roadmap
statusis derived from children —in_progressonce any child starts;shipped/doneonly when all children reachshipped; surfacesblocked/failedchildren for triage. - Budget rollup: roadmap
budget= Σ child budgets with an explicit ceiling; breaching the ceiling pauses fan-out (ties to §12 kill-switch). - Verify rollup: each leaf runs its own
verify; the roadmap's acceptance gate runs after all leaves pass (e.g. an integration/e2e gate). - Phase gates: the roadmap's own phase Exit-criteria become runtime gates — fan-out of phase N+1 is blocked until phase N's children ship; human approval between phases is the default for
yolosafety. - Idempotent re-run: re-running a roadmap skips already-
shippedchildren (content-hash dedupe, §5); only unfinished/changed children re-queue.
24.5 Source-of-truth & sync (no drift)
Composite work obeys the same SoT discipline as the core contract (§4 immutable manifest) and the tracker echo (§10): a roadmap/epic is one record referenced by many, never duplicated.
- The roadmap/epic is the SoT for what/why + rollup status; each leaf job/run is the SoT for its execution.
- Children reference the parent by
parentId; the planner writes the child set once at decomposition (immutable manifest snapshot). Re-planning creates a new revision, it does not mutate in-flight children. - Status flows one way, child → parent → tracker (the §10 echo); humans never hand-edit rollup state.
24.6 Decision — Hybrid (recorded)
Model composite delegation in the fleet layer now; defer the shared-platform enum change until proven.
- Now (fleet-owned): add
kind(leaf/composite),parentId, and rollup to thefleet_jobsschema (§13). The fleet owns this schema outright — no cross-product risk. - Tracker stays
bug/feature/task(the sharedITEM_TYPESused by all 9 products is unchanged). A roadmap is represented by a parent item + labelkind:roadmap+parentIdon children — zero platform migration, no sign-off needed. - Later (optional, gated on proven value): promote
kind:roadmap→ a first-classepictrackertypevia an additive migration (backfill items wherelabelscontainskind:roadmapintotype: epic, keep the label as an alias during transition). Low-risk because the behavior already works fleet-side. - Rationale: avoids a speculative 9-product platform change (UI/filters/stats/tests) before the orchestration model is validated; if the model is wrong, only fleet code is refactored, not a platform enum every product depends on.
24.7 Phasing & gates
- P1–P2: leaf-only (no composite);
kinddefaults toleaf. - P3: composite scheduling + rollup + DAG view in the control plane, with manual decomposition (a human/author defines the child set).
- P3→P5: the auto-decomposition planner agent (itself a
compositejob run by theplannerprofile) — start manual, automate once trustworthy. - Acceptance: a roadmap with N child jobs fans out across ≥2 factories, respects
deps+ phase gates, rolls up status/budget correctly, and a re-run skips shipped children; tracker shows the parent movingin_progress → donevia the one-way echo. - Verify gate: composite-orchestration tests — DAG expansion, rollup status/budget, phase-gate blocking, idempotent re-run; control-plane e2e for the roadmap DAG view.
25. Durability, crash recovery & work preservation
Goal: a machine power-off, daemon/agent crash, or network partition never loses the job, its instructions, or in-progress work, and never corrupts state. Recovery is automatic and idempotent.
25.1 Instructions are durable (markdown in Cosmos)
- The full job instruction body is persisted verbatim as markdown in
fleet_jobs.bodyMd(§13), alongside the structured manifest. The originating trackerItem.descriptionalso retains the human instruction text; the two are linked bytracker-item, never duplicated as competing truth (§24.5). - A factory only ever holds a transient materialized copy (temp prompt file) fetched from the API — losing the factory loses nothing. On the offline edge, the
.mdfile on disk is the durable copy and reconciles on reconnect (§9).
25.2 Work-in-progress is preserved (checkpointing)
- For a git-repo
cwd, the worker commits WIP to a dedicated branchaq/wip/<jobId>at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits tomain/protected branches (§12 push policy). (P1-S3:_wip_start/_wip_checkpoint+ EXIT/INT/TERM trap; non-git cwd skipped.) fleet_jobs.checkpointrecords the WIP branch + last commit so any worker can find it. (P2 Cosmos; single-host recordswip_branch/wip_base/wip_commitin<job>.meta.)- Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window. (P1-S3: start + every-exit-path commits bound the loss window.)
25.3 Recovery is automatic, resumable, and fenced
- Orphan detection: on coordinator/runner startup (and continuously), a job in
building/assignedwhose worker is dead (no live lease / dead pid) is an orphan; it is recovered, not stranded. (P1-S3:recover_orphansonrunstartup + each loop, andagent-queue.sh recover; dead-pid +pidstartreuse guard.) - Resume vs restart: recovery starts a new
fleet_runsattempt; ifaq/wip/<jobId>exists, the new worker resumes from the checkpoint instead of restarting from zero. (P1-S3: relaunch checks outaq/wip/<job>;attemptsincremented.) - Fencing (§4): the reclaimed run gets a higher
leaseEpoch; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of visible outcomes. (P2 — distributed leasing; out of single-host scope.) - Retry policy (
retry.max/backoff/on): agentrc≠0/timeout/verify_failedrequeue with backoff up tomax; on exhaustion →dead_letter(P2) /failed(P1 stand-in) with full diagnostics — never silently dropped. (P1-S3 single-host.) - State integrity: all run state is append-only / optimistic-concurrency guarded (§13); recovery is idempotent (running it twice yields one recovery). (P1-S3 single-host: meta is append-only + re-derivable from folder location;
_etagguard is P2.)
25.4 Crash taxonomy (all handled)
| Failure | Detection | Recovery |
|---|---|---|
Agent process crash (rc≠0) |
exit code | retry policy → requeue or failed/dead_letter |
| Daemon/runner crash | lease not renewed | reaper reclaims → resume from checkpoint |
| Machine power-off / partition | missed heartbeats + lease expiry | reaper + fencing + WIP resume elsewhere |
| Coordinator restart | state in Cosmos | leases survive; in-flight reconciled on boot |
- Acceptance: SIGKILL an agent and power-off a factory mid-run → another worker resumes from the last checkpoint (not from zero) and ships; instructions intact (read back from Cosmos
bodyMd); zero duplicate commits/merges; a retry-exhausted job lands indead_letter/failedwith diagnostics. - Verify gate: chaos tests — kill agent, kill runner, simulate partition; assert resume-from-checkpoint, fencing rejection of the stale worker, instruction integrity, and no double-merge.
26. Execution insights & token accounting
Goal: per-job/run visibility into token usage, cost, model, latency, and tool activity — to drive budgets (§5/§12), cost burndown (§17), and learned routing (§14 P5).
- Per-run telemetry record (in
fleet_runs, streamed asfleet_events): engine, model, tokensIn/Out (+cached), cost USD (estimated:truewhen not provider-reported), wall + CPU time, turn count, tool-call counts, verify pass/fail, filesChanged, linesAdded/Deleted, attempt number, retries. (P1-S3 single-host: recorded in<job>.meta—duration_s,files_changed/lines_added/lines_deleted, tokens/cost/turns/tool_calls,attempts; CPU time not captured.) - Token source (honest feasibility): capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise estimate from log heuristics and mark
estimated— same caveat asbudget.usd/tokens(§5). A singleparse_usage(engine, log)adapter centralizes per-engine extraction. (P1-S3:parse_usageadapter; genericAQ_USAGEline + Claude/Codex heuristics; Devin/Copilot TODO;usage_estimatedflag, never fabricated.) - Aggregation/rollups: per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14). (P1-S3 partial:
aq insightsdoes per-job + per-engine rollup; product/factory/profile/day are P2/P3.) - Surfacing: control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present. (P1-S3 partial: edge CLI
aq insights+status/dashinsights line done; web control-plane panels are P3.) - Privacy: telemetry carries metrics + pointers only — never prompt content or secrets (redaction §12). (P1-S3: insights/meta record only metrics; no prompt body or secrets added.)
- Acceptance: after a run, its
fleet_runscarries token/cost/duration/tool/diff metrics (real where metered, flaggedestimatedotherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone. - Verify gate: telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified.
This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.