saravanakumardb1 67d8aa5766 docs(agent-queue): add work hierarchy & composite delegation (roadmap/epic)

New §24 + feature-catalog row:
- two delegation modes: atomic (leaf bug/feature/task) vs composite (roadmap/epic)
- introduce job kind (leaf|composite); composite routes to a planner/orchestrator
  that fans out child leaf jobs as a DAG across factories/agents/profiles
- parentId hierarchy + rollup semantics (status/budget/verify/phase-gates) +
  idempotent re-run (skip shipped children)
- source-of-truth/sync discipline (one record referenced by many; one-way echo)
- HYBRID decision recorded: model kind/parentId/rollup in the fleet layer now,
  keep shared tracker ITEM_TYPES unchanged (label kind:roadmap), promote to a
  first-class epic type later via additive migration once proven
- phasing: leaf-only P1-P2; manual composite P3; auto-decomposition planner P3->P5

2026-05-29 18:02:10 -07:00

51 KiB

Raw Blame History

Agent Gigafactory — Vision & Implementation Roadmap

One-liner: Evolve today's single-host agent-queue bash runner into a distributed gigafactory — a fleet of heterogeneous machines (Mac/Ubuntu/Windows), each running different coding-agent CLIs (Devin/Codex/Claude/Copilot/…), where a scheduler auto-picks jobs from a shared inbox and routes each .md to the best factory × tool × profile — built service-side on platform-service + tracker-web, with the bash runner surviving as the offline edge agent.

How to use this doc: It is both a PRD and an execution checklist. Every feature is a - [ ] checkbox with acceptance criteria and a verify gate. A phase is "100% done" only when every box is checked, its gate passes, and the phase Definition of Done rubric (§16) is green. Update the progress table (§0) as you go.

0. Progress tracker

Phase	Theme	Status	%	Gate
0	Baseline (today)	✅ shipped	100%	`selftest.sh` green
1	Manifest + profiles + capabilities + tracker adapter (single host)	☐ not started	0%	adapter e2e + selftest
2	Coordinator as platform-service module + Cosmos + multi-factory leasing	☐ not started	0%	fleet e2e + module tests
3	Fleet control plane in tracker-web + DAG deps + budgets + scoring router	☐ not started	0%	web e2e + router tests
4	Message bus + autoscaling + cross-OS capability marketplace	☐ not started	0%	load/chaos suite
5	Self-optimizing / learned routing	☐ not started	0%	offline eval + A/B

Legend: ☐ not started · ◐ in progress · ✅ done. Keep per-phase checklists below as the source of truth; this table is the summary. Owners per phase: §23 · rollout/rollback: §21 · capacity & SLOs: §22/§19.

1. Vision & metaphor

A gigafactory turns raw intent (.md task files / tracker items) into shipped software with minimal human touch. The mental model is a physical factory network:

Term	Meaning
Fleet	The whole network of machines under one control plane.
Factory	One physical/virtual machine (a Mac, an Ubuntu box, a Windows host). Has an OS, installed tools, creds, capacity.
Station	A tool/engine slot inside a factory (a Devin seat, a Codex CLI, a Claude Code session, a Copilot agent).
Worker	A single running agent process executing one job at a station.
Job	A unit of work: a prompt/`.md` + manifest (profile, scope, gates, budget).
Profile	The role doing the work (developer, backend engineer, UX/UI designer, QA, reviewer) = persona prompt + capability requirements.
Capability	A tag a factory advertises and a job requires (`os:mac`, `has:xcode`, `has:figma`, `gpu`, `engine:devin`).
Lease	A time-boxed claim of a job by a worker; expires → job is reclaimable (crash recovery).
Gate	A checkpoint a job must pass: auto-QA `verify`, human review, ship approval.
Artifact	Any captured output: commits/PRs, logs, screenshots, reports, build outputs.

North star: drop work into one inbox (or file a tracker task), and the fleet figures out where (factory), with what (tool/engine), as whom (profile), runs it in parallel, self-heals on crash, gates quality automatically, and surfaces everything in one live control plane — while a human only approves the final ship.

                         ┌──────────────────────── CONTROL PLANE (tracker-web) ────────────────────────┐
                         │  plan/intake · roadmap · Fleet map · live logs · cost · approvals           │
                         └───────────────▲───────────────────────────────────┬─────────────────────────┘
                                         │ REST/SSE                           │
            ┌────────────────────────────┴─────── COORDINATOR (platform-service module) ───────────────┐
            │  queue · scheduler/router · leases · profiles · capabilities · events · budgets (Cosmos)  │
            └───▲───────────────────────▲───────────────────────▲───────────────────────▲───────────────┘
                │ claim/lease/report     │                       │                       │
        ┌───────┴───────┐       ┌────────┴───────┐       ┌────────┴───────┐       ┌───────┴────────┐
        │  FACTORY: mac │       │ FACTORY: ubuntu│       │FACTORY: windows│       │ FACTORY: mac-2 │
        │ devin, claude │       │ codex, claude  │       │ copilot, codex │       │ devin (xcode)  │
        │ [agent-queue] │       │ [agent-queue]  │       │ [agent-queue]  │       │ [agent-queue]  │
        └───────────────┘       └────────────────┘       └────────────────┘       └────────────────┘

2. Current state (Phase 0 baseline — already shipped)

Today's agent-queue.sh + dashboard.mjs (single host, zero-dep bash + Node):

Folder kanban lifecycle: inbox → building → review → testing → shipped (+ failed).
Auto-QA gate: agent rc=0 → review/; optional verify: runs in cwd → pass testing/, fail failed/; no verify → parks in review/. Manual ship = the human gate.
Per-job frontmatter: engine (devin/claude/codex), cwd, yolo (→ dangerous/auto-approve), lock (per-repo serialization), timeout, verify.
Concurrency: AGENT_QUEUE_MAX (default 3), per-lock serialization so same-repo jobs never collide.
State & logs: .state/<job>.meta heartbeats + logs/<job>.log; git-tracked queue (audit-by-commit).
Interactive dashboard: numbered selectable job list, single-key actions (promote/ship/reject/requeue), live log viewer, run/stop, all shelling out to agent-queue.sh.

Carries forward: the .md-in-inbox UX, frontmatter contract, lifecycle stage names, verify gate, lock/affinity concept, the bash runner itself (becomes the factory agent). Must change for the fleet: single-host run loop → distributed leasing; file-only state → service + Cosmos; one engine choice → capability/profile routing; local dashboard → shared control plane.

Phase 0 complete — baseline shipped and self-tested. (reference, not a work item)

3. Goals & non-goals

Goals

One intake, many machines: parallel execution across heterogeneous OS/tools.
Automatic routing to the best factory × tool × profile with affinity, fairness, budget, and health awareness.
Self-healing (lease expiry/requeue), quality gates, and full observability.
Reuse the ByteLyst stack (platform-service, Cosmos, @bytelyst/*, tracker-web) — no parallel infra.
Preserve offline/zero-dep edge operation via the bash runner.

Non-goals

Not a CI/CD replacement (it triggers CI; CI still gates merges).
Not a general-purpose workflow engine (scoped to coding-agent execution).
Not a model/inference host (it orchestrates agent CLIs, doesn't serve models).
Not abandoning the simple .md mental model — humans still drop files / file tasks.

4. Core concepts contract (must hold across all phases)

Every job has a stable id, an immutable manifest, and an append-only event log.
Every Cosmos document carries productId (ByteLyst rule).
A job in flight is always covered by exactly one lease; no live lease → reclaimable.
Atomic claim: a job is assigned to exactly one worker via optimistic concurrency (Cosmos _etag/If-Match or a conditional fleet_leases insert keyed by jobId). Concurrent claimers — exactly one wins; losers retry the next candidate.
Fencing token: every lease carries a monotonic leaseEpoch. Every report/commit/ship carries its epoch; the coordinator rejects writes from a stale epoch, so a partitioned or zombie worker cannot corrupt state after its lease was reclaimed.
Coordinator-authoritative time: all lease/TTL/SLA math uses server timestamps, never factory clocks (clock-skew safety).
Lifecycle stages are canonical and shared: queued → assigned → building → review → testing → shipped (+ blocked, failed, dead_letter).
The bash runner and the service speak the same manifest + event vocabulary (one schema, two transports).

5. The evolved Job manifest (feature)

Extend today's frontmatter into a richer, backward-compatible manifest. Old .md files keep working (new fields optional with sane defaults).

---
# --- existing (unchanged) ---
engine: devin            # explicit engine; overrides profile/engine-class
cwd: /abs/path/repo
yolo: true
lock: my-repo
timeout: 45m
verify: pnpm -s test
# --- new ---
profile: backend-engineer        # role: persona + capability requirements
engine-class: agentic-coder      # abstract; scheduler picks a concrete engine if `engine` unset
capabilities: [os:any, node>=20] # hard requirements a factory MUST satisfy
prefers: [factory:mac-2]         # soft routing hints (affinity)
priority: high                   # critical|high|medium|low → SLA + preemption
budget: { usd: 5, tokens: 2M, wall: 4h }   # wall = HARD ceiling (always enforceable). usd/tokens = best-effort
                                           # caps: enforced only where the engine/provider exposes live metering;
                                           # otherwise estimated from provider usage APIs post-hoc + alerted.
deps: [job-123, job-456]         # DAG: don't start until these reach `shipped`/`testing`
idempotency-key: nomgap-ux-2     # dedupe: a second identical submit is a no-op
retry: { max: 2, backoff: 5m, on: [timeout, verify_failed] }
review-policy: manual            # auto|manual|reviewers:[@alice]
artifacts: [coverage, screenshots]   # what to capture beyond commits
tracker-item: ITEM-789           # link back to the originating tracker task
---

Define the manifest schema (Zod in the service; documented YAML for .md).
Backward-compat: a Phase-0 .md (only engine/cwd/yolo) parses with all new fields defaulted.
Capability grammar defined: tokens are key (presence, e.g. has:xcode), key:value (e.g. os:mac, engine:devin), or key<op>version with op ∈ {>=,>,=,<=,<} (e.g. node>=20). os:any is a wildcard that matches every factory. A job matches a factory iff every required token is satisfied by the factory descriptor.
engine-class taxonomy defined as an enum (agentic-coder, chat-coder, review-only) with a documented engine→class map (devin,claude,codex → agentic-coder; copilot → chat-coder). If engine is set it wins; else the scheduler picks any free engine in the class honoring prefers-engine.
idempotency-key semantics: key + content-hash identical ⇒ no-op (returns existing job). Same key, different content ⇒ rejected with 409 unless the prior job is still queued/blocked (then it is superseded). A re-run/retry of an existing job is not a new submit and never trips dedupe.
deps semantics: a dep is satisfied when it reaches shipped (default) or testing if deps-mode: soft. Submit-time cycle detection rejects cyclic graphs; unmet deps put the job in blocked (not queued). Cross-factory deps require the coordinator (P2); single-host deps work in P1.
Acceptance: a manifest fixture suite parses/validates; invalid manifests fail with precise errors; capability-grammar + dep-cycle + idempotency-conflict cases covered.
Verify gate: schema unit tests (≥ 1 per field incl. defaults + 5 invalid cases + grammar/cycle/409 cases).

6. Profiles — persona + capability (feature)

A profile = a versioned file combining a persona (system-prompt overlay), required capabilities, default gates, preferred engine/model, and allowed repo scopes. Stored as profiles/<name>.md (Phase 1) → Cosmos profiles container (Phase 2).

# profiles/backend-engineer.md
---
name: backend-engineer
persona: |
  You are a senior backend engineer. Favor minimal, well-tested changes...  
capabilities: [node>=20, has:pnpm]
default-verify: pnpm -s typecheck && pnpm -s test
engine-class: agentic-coder
prefers-engine: [devin, claude]
allowed-scope: ["backend/**", "packages/**"]   # blast-radius guardrail
review-policy: manual
---

Author starter catalog: developer, backend-engineer, frontend-engineer, ux-designer, ui-designer, qa, reviewer, docs-writer.
Persona overlay is prepended to the job body before the agent runs; secrets are never written to logs or the event stream (redaction at the source).
Profile supplies default verify, capabilities, engine-class, allowed-scope when the job omits them.
Profile versioning: changing a profile doesn't mutate in-flight jobs (snapshot at assign time).
allowed-scope enforced as a guardrail (warn in P1, enforce/deny in P2 via pre-flight diff check).
Acceptance: a job with profile: backend-engineer and no verify inherits the profile's verify + persona.
Verify gate: profile-resolution unit tests; persona-injection golden test.

7. The scheduler / router (the heart) (feature)

Given a queued job and the current fleet, choose (factory, station/engine, profile) and issue a lease.

Inputs: job manifest (capabilities, priority, budget, deps, prefers, lock), profile requirements, live factory descriptors (capabilities, load, health, cost class), lock/affinity table, fairness counters.

Algorithm (deterministic, explainable):

Filter factories by hard capability match (job ∪ profile capabilities ⊆ factory capabilities) and free station for a compatible engine.
Block if deps unmet or lock already held → leave queued/blocked.
Score each candidate factory: score = w1·capabilityFit + w2·affinity(prefers, repo-stickiness) + w3·(1/load) + w4·costFit(budget) + w5·health − w6·starvationPenalty
Tie-break: highest priority job first; then oldest; then lowest cost class.
Assign atomically → create the lease under an optimistic-concurrency guard (_etag/If-Match or conditional insert keyed by jobId) with a fresh leaseEpoch; on conflict another factory won → retry the next candidate. Set job assigned, decrement station/seat capacity, bump fairness counter. Use coordinator-authoritative timestamps only.
Preemption (P3+): a critical job may pause a low job at a needed station (checkpoint + requeue, bumping the preempted job's leaseEpoch).

Phasing: Phase 2 ships the deterministic filter + atomic-assign core (fixed weights). Phase 3 adds tunable weights, preemption, and the explainability UI. Phase 5 learns the weights (§14).

Implement pure, unit-testable scoring function (no I/O) with configurable weights.
Hard-filter correctness: never assign a job to a factory missing a required capability.
Affinity/stickiness: same-repo jobs prefer the factory that has the warm checkout (lock-aware).
Fairness: no factory or product starves under sustained load (counter + penalty).
Explainability: every assignment records why (matched caps, score breakdown) in the event log.
Determinism: same inputs → same decision (seeded tie-breaks) for testability.
Define factory health ∈ [0,1] = f(heartbeat freshness, recent run failure-rate, resource pressure); factories below a health floor are filtered out, not merely down-weighted.
Station/seat capacity: a factory's free stations = min(host slots, per-engine seat limits) (e.g. licensed Devin/Claude seats); the scheduler never over-subscribes a seat-limited engine.
Distributed lock: the Phase-0 local lock becomes a coordinator-held lock so same-lock jobs serialize across the whole fleet (prevents two factories pushing the same repo concurrently).
Acceptance: scenario fixtures (10+) produce expected assignments incl. starvation, capability-miss, seat-exhaustion, unhealthy-factory exclusion, and budget-exceed; a concurrent-claim race test proves exactly one winner.
Verify gate: router unit suite ≥ 95% branch coverage on the scoring/filter core; atomic-claim race test.

8. Factory model & registration (feature)

Each machine runs a factory agent (the evolved agent-queue runner) that registers, heartbeats, claims jobs, and reports events.

Capability auto-detection at boot: OS, installed engines (devin/claude/codex/copilot), tool probes (xcode, figma-cli, docker, gpu), node/pnpm versions, available creds (presence only, never values).
Enrollment / bootstrap trust: first registration authenticates with a one-time enrollment secret (or an operator-issued platform JWT). The factory then receives a scoped, rotatable factory token (jose JWT); decommission = revoke. No standing shared secret in the queue.
Registration: POST /fleet/factories with descriptor → receives a factory id + token.
Heartbeat: periodic PUT /fleet/factories/:id/heartbeat (load, free stations, health). A coordinator lease reaper (not Cosmos TTL) sweeps expiresAt < now and reclaims, bumping leaseEpoch so the dead/zombie worker is fenced; a factory missing N heartbeats is marked offline and all its leases reclaimed.
Claim loop: POST /fleet/leases/claim advertising capabilities/free stations; atomic (exactly one winner, §4); receives a job + lease TTL + leaseEpoch. Use claim backoff / long-poll to bound Cosmos RU under many idle factories (see §22); broker push replaces polling in P4.
Report: stream stage/log/event back (POST /fleet/runs/:id/events), echoing leaseEpoch (stale epoch → 409, worker self-aborts); renew lease while alive.
Environment prep: before verify, the factory ensures deps are installed (cold checkout → pnpm install); prep time counts against budget.wall.
Graceful drain: factory can stop claiming, finish in-flight, deregister.
Acceptance: a factory enrolls, claims a matching job, heartbeats, completes; a killed factory's job is reclaimed by another within the lease TTL and the killed worker's late report is rejected by fencing.
Verify gate: factory-agent integration test against a mock coordinator; crash-recovery + fencing-rejection test.

9. Coordination architecture (decision + path)

Three transports were evaluated. Decision: platform-service-native coordinator is the spine; git-queue stays for the offline edge; broker added only at scale.

Option	Pros	Cons	Verdict
(a) Git-synced queue (evolve folders)	zero infra, audit-by-commit, offline	weak/racey leasing, latency, merge churn	Edge/offline only
(b) Coordinator service (platform-service module)	real leases, fairness, observability, reuses auth/Cosmos/productId	a service to run	Chosen spine (P2)
(c) Message broker (NATS/Redis/SQS)	scale, backpressure, push dispatch	most moving parts/ops	P4 when throughput demands

Document the decision + rationale in-repo (this section is the canonical record).
Define the claim/lease protocol once; both git-queue (poll) and service (API) implement it.
Split-brain / network-partition safety: a partitioned factory may keep running and even git push. idempotency-key dedupes submits but cannot undo side-effects. Mitigation: fencing — the coordinator rejects ship/merge reports from a stale leaseEpoch, and the distributed lock (§7) prevents a reclaimed-job's twin from pushing the same repo. Residual risk (a stale push to a feature branch) is contained by the PR-merge ship gate (§10) and surfaced for human triage.
Offline-degrade: a factory cut off from the coordinator falls back to its local git-queue and reconciles on reconnect; on reconnect it presents its leaseEpoch — if reclaimed, its results are quarantined, not auto-merged.
Poll cost: bound claim-loop RU via long-poll/backoff (§22); migrate to broker push at P4.
Acceptance: the same job manifest runs identically through the bash/git path and the service path; a simulated partition does not double-merge (fencing test).
Verify gate: contract test asserting protocol parity (git vs service) + partition/fencing test.

10. tracker-web / platform-service integration (committed path)

Layering: tracker = WHAT/WHY (plan, intake, prioritize, roadmap, votes) · gigafactory = HOW (execute) · platform-service = shared brain · agent-queue runner = offline edge. Grounded in the real tracker-service model (Item: type bug/feature/task, status open/in_progress/done/closed/wont_fix, priority, labels, assignee, source incl. auto_detected, votes, comments, public roadmap) and the tracker-web /api/tracker/[...path] proxy pattern.

Phase 1 — Adapter (no new infra)

task → job: a tracker Item of type: task (e.g. assignee: @agent or label agent:run) is exported to a job .md (manifest mapped: title/description → body, priority → priority, labels → capabilities/profile hints).
job → tracker: lifecycle events post back as status updates + comments — building → status in_progress + comment "started on factory X"; shipped → done + comment with commit SHAs / PR link / verify results; failed → comment with reason (status stays in_progress for human triage).
Idempotency: re-running the adapter for the same item doesn't create duplicate jobs (idempotency-key = item id + content hash).
Adapter is a thin script/CLI (aq from-tracker ITEM-789) + optional poller.
Acceptance: filing a tracker task, marking it agent:run, results in a queued job; on ship, the item flips to done with a SHA comment.
Verify gate: adapter e2e against a tracker-service test instance (or mock); round-trip assertion.

Stage → tracker status mapping (tracker's enum is coarser than the fleet's; keep fine-grained stage in a label + comment so no detail is lost):

Fleet stage	Tracker `status`	Extra
`queued` / `assigned` / `blocked`	`in_progress`	label `fleet:<stage>`
`building` / `review` / `testing`	`in_progress`	label `fleet:<stage>` + progress comment
`shipped`	`done`	comment with SHA(s)/PR link/verify result
`failed` / `dead_letter`	`in_progress` + label `needs-triage`	never auto-`closed`/`wont_fix` (humans decide)

Ship semantics (PR flow): shipped = change merged to target branch with CI green (default), OR pr-opened when review-policy defers merge to humans/CI — configurable per profile. This honors the non-goal that CI still gates merges (§3); the agent never bypasses branch protection.

Phase 2 — Native spine

Stand up a fleet (a.k.a. orchestrator) module inside platform-service, sibling to tracker-service: pattern types.ts → repository.ts → routes.ts, ESM, Cosmos, productId, req.log.
Endpoints: jobs CRUD, claim/lease, events/report, factories register/heartbeat, profiles, stats.
Runners (bash + any) become API clients of this module; tracker adapter calls it directly.
Acceptance: a job submitted via the module is claimed by a real factory and shipped, with all state in Cosmos.
Verify gate: module test suite (repository + routes) using the shared @bytelyst/testing inject helpers.

Phase 3 — Unified control plane

Add a Fleet surface to tracker-web reusing auth/Primitives/DataTable/product switcher: fleet map (factories + load/health), job table, job DAG, live log streaming, lease/heartbeat status, cost burndown, approve/ship buttons.
Streaming caveat (correctness): live logs must not use the existing buffering catch-all proxy /api/tracker/[...path] — it does res.text() and would never stream. Use a dedicated Next.js Route Handler returning a ReadableStream (SSE) or a direct SSE/WebSocket to platform-service. Full logs are shipped to blob storage (§17); the endpoint serves stored tail + live append.
The Node TUI dashboard becomes a thin client of the same /fleet API (parity with web).
Acceptance: an operator can watch all factories + tail any job log + ship from the browser.
Verify gate: web e2e (Playwright) covering fleet map render, live log, and a ship action.

11. Lifecycle & gates at scale (feature)

Canonical stages enforced server-side: queued → assigned → building → review → testing → shipped (+ blocked, failed, dead_letter); transitions validated (illegal transition → 409).
Per-profile default verify; per-job override; verify runs at the factory, result reported as an event.
Human gates: review-policy routes to reviewers; multi-reviewer support (P3).
Dead-letter: after retry.max exhausted, job → dead_letter with full diagnostics; never silently dropped.
Backpressure: when no factory can take more, jobs stay queued (no thrash); SLA timers visible.
Ship semantics are profile-configurable (merged+green vs pr-opened, §10); shipped is terminal-success, dead_letter terminal-failure; blocked (unmet deps) is distinct from queued.
Retry vs idempotency: a retry creates a new fleet_runs attempt under the same job/idempotency-key (never a duplicate job); backoff honored; retry.on filters which failure classes retry.
Acceptance: a perpetually-failing job lands in dead_letter after configured retries; a passing one auto-advances to testing then waits for human ship; an illegal transition is rejected.
Verify gate: lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path).

12. Security, safety & governance (feature — critical with `yolo`/dangerous)

Secret isolation: creds live on each factory (env/keychain), never in the queue, manifest, logs, or Cosmos. Factory advertises presence of a cred capability, not the value.
Scoped git tokens per factory/repo; least-privilege; rotation documented.
Push policy: protected branches; agents push to feature branches + open PRs by default; direct-to-main gated by profile/flag.
Blast-radius guardrail: enforce allowed-scope — pre-flight + post-run diff check; out-of-scope changes block the ship gate.
Budget kill-switch: exceed budget (usd/tokens/wall) → pause worker, alert, require human resume.
Supply-chain safety: edits to shared @bytelyst/* packages require reviewer profile + human gate (never auto-ship).
Audit trail: append-only event log per job (who/what/when/where/cost); immutable.
Corp network/proxy: honor NETWORK/proxy + truststore conventions on factories that need them.
Kill switch (global): one command/flag halts all claiming fleet-wide (incident response).
Acceptance: a job attempting an out-of-scope edit is blocked at the gate; a budget overrun pauses and alerts; no secret ever appears in any persisted artifact (scanner test).
Verify gate: security test suite incl. a secret-leak scanner over logs/meta + scope-enforcement test.

13. Data model (Cosmos containers, P2+)

Each container partitioned sensibly; every doc has productId.

fleet_jobs (pk /productId) — manifest snapshot, current stage, idempotency-key, tracker-item link.
fleet_runs (pk /jobId) — one per execution attempt: factory, engine, profile snapshot, timings, cost, exit, verify result.
fleet_leases (pk /jobId) — holder factory, expiresAt, leaseEpoch (fencing), renewals. Reclaim via a coordinator reaper that scans expiresAt < now — Cosmos TTL only garbage-collects stale rows, it cannot trigger reclaim logic. Claim guarded by _etag/If-Match.
fleet_factories (pk /productId) — descriptor, capabilities, health, load, last heartbeat, seat limits.
fleet_profiles (pk /productId) — versioned profile snapshots (immutable per version).
fleet_events (pk /jobId) — append-only audit/event stream (stage changes, log pointers, cost ticks, scheduler decisions).
fleet_artifacts (pk /jobId) — pointers to blob-stored logs + artifacts (coverage, screenshots, build output). Large logs live in @bytelyst/blob, never inline in Cosmos (doc-size + RU limits).
Relate to existing tracker Item via tracker-item (no duplication of planning data).
Optimistic concurrency (_etag) on every job stage transition + lease claim to prevent lost updates / double-assignment.
Indexing/RU: the claim query is hot — index stage, priority, capabilities; avoid cross-partition fan-out; provision RU/s per §22.
Acceptance: repository CRUD + query tests per container; atomic-claim race test (N concurrent claimers → exactly one wins); reaper-reclaim + fencing-rejection test; lease-expiry verified via reaper (not TTL).
Verify gate: repository unit/integration tests (memory + Cosmos provider via DB_PROVIDER).

14. Phased build roadmap (checklists)

Each phase: Goal → checklist → Exit criteria. Don't start a phase until the prior phase's Exit criteria are green. Tick boxes here as the canonical progress.

Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)

Goal: richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.

Extend agent-queue.sh frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible.
Add profiles/ directory + profile resolution (persona injection, default verify/caps/scope) (§6).
Local capability detection + a job/factory capability match check before launch (§8 subset).
priority ordering in the inbox pick (replace pure FIFO with priority-then-age).
deps (DAG) blocking on a single host; idempotency-key dedupe on add.
retry with backoff into failed/requeue; budget.wall enforced (extends timeout).
allowed-scope guardrail (warn-only this phase) + post-run diff report.
Tracker adapter aq from-tracker <ITEM> + aq to-tracker event poster (§10 P1).
Dashboard shows profile + priority + capability tags + tracker-item link.
Update selftest.sh with: manifest parse fixtures, profile resolution, priority order, dep-block, idempotency, adapter round-trip (mock).
Update README + this doc's progress table.
Exit criteria: all boxes ✅; selftest.sh green; a tracker task → executed → tracker done with SHA comment, fully on one host; no regression to Phase-0 .md files.

Phase 2 — Coordinator as platform-service module + Cosmos + multi-factory leasing

Goal: the service spine; ≥2 real factories executing in parallel via leases.

Scaffold fleet/orchestrator module in platform-service (types/repository/routes, Zod, ESM, productId).
Cosmos containers (§13) + repository layer (memory + Cosmos providers).
Atomic claim (optimistic concurrency / _etag) + lease reaper + fencing (leaseEpoch) endpoints (§4/§8/§9) — not Cosmos-TTL-driven reclaim.
Port agent-queue runner to a factory agent API client (enroll/register/heartbeat/claim/report, fencing-aware) while keeping git-queue fallback.
Scheduler/router core (§7) as a pure module (fixed weights) + wired into atomic assignment.
Tracker adapter calls the module directly (not just file export).
Auth: factory enrollment + scoped rotatable tokens; secret isolation enforced (§12 subset).
Feature flags (fleet.enabled, fleet.route_via_service) + shadow/dual-run vs P1 before cutover (§21).
Module test suite (repository + routes via @bytelyst/testing); atomic-claim race, crash-recovery, fencing-rejection, reaper-reclaim tests.
Two-factory demo (e.g. mac + ubuntu) running 3 parallel jobs end-to-end.
Exit criteria: all boxes ✅; pnpm --filter @lysnrai/platform-service test green; killing a factory mid-job → another reclaims and completes and the dead worker's late report is fenced; concurrent claimers never double-assign; all state in Cosmos with productId; flag-off rollback verified (§21).

Phase 3 — Fleet control plane in tracker-web + DAG + budgets + scoring router

Goal: one browser control plane; smart routing + budgets live.

fleet API client in tracker-web (reuse /api/tracker-style proxy → /fleet).
Fleet map page (factories, load, health, capabilities) on @bytelyst/* components.
Job table + job detail + DAG view; live log via SSE; approve/ship/reject/requeue actions.
Cost burndown + budget kill-switch UI; multi-reviewer routing.
Scoring router with configurable weights + explainability surfaced in UI.
Preemption of low-priority by critical jobs (checkpoint + requeue).
TUI dashboard re-pointed at /fleet API (parity).
Web e2e (Playwright): fleet map, live log, ship, budget-pause.
Exit criteria: all boxes ✅; web verify (typecheck+lint+test+e2e) green; an operator runs the whole 3-repo parallel workload from the browser, including a budget pause + resume.

Phase 4 — Message bus + autoscaling + cross-OS capability marketplace

Goal: scale-out and elasticity.

Introduce broker (NATS/Redis) for push dispatch + backpressure; coordinator publishes, factories subscribe by capability.
Autoscaling hooks (spin ephemeral factories: cloud VM / container) keyed to queue depth + SLA.
Capability "marketplace": jobs requiring rare caps (xcode/figma/gpu) routed to the few factories that have them; queueing fairness across products.
Load + chaos test suite (factory churn, broker outage, thundering herd).
Exit criteria: all boxes ✅; sustained N×throughput vs Phase 3 under load test; graceful degradation on broker outage (fallback to poll).

Phase 5 — Self-optimizing / learned routing

Goal: the scheduler learns from history to cut time/cost and raise first-pass success.

Capture outcome features per run (engine, profile, repo, duration, cost, verify pass, human-edit rate).
Offline eval harness comparing learned vs heuristic routing on historical data.
Shadow/A-B rollout with guardrails; auto-tune scoring weights.
Recommendations surfaced ("route NomGap UX jobs to claude on mac-2: 23% faster, 11% cheaper").
Exit criteria: all boxes ✅; learned router beats heuristic on the eval set without regressing safety gates; A/B shows measurable improvement on a target metric.

15. Cross-cutting feature catalog (quick index)

Feature	First phase	Section
Evolved job manifest	P1	§5
Profiles (persona + capability)	P1	§6
Capability matching	P1→P2	§6/§8
Priority + SLA	P1	§5/§7
DAG dependencies	P1→P3	§5/§11
Idempotency / dedupe	P1	§5
Retry + dead-letter	P1→P2	§11
Budgets + kill-switch	P1(wall)→P3	§5/§12
Scheduler/router scoring	P2→P3	§7
Factory registration/heartbeat/lease	P2	§8
Coordinator (platform-service module)	P2	§9/§10
Cosmos data model	P2	§13
Tracker bi-directional sync	P1→P2	§10
Web control plane + SSE logs	P3	§10/§17
Security/scope/secret isolation	P1→P2	§12
Broker + autoscaling	P4	§14
Learned routing	P5	§14
Atomic claim + fencing + distributed lock	P2	§4/§7/§9
Rollout / rollback / feature flags	P2→	§21
Capacity planning & RU/cost	P2→	§22
Ownership & RACI / on-call	all	§23
Work hierarchy & composite delegation (roadmap/epic)	P3 (manual) → P5 (planner)	§24

16. Definition of Done — the "100% accuracy" rubric

A feature/phase is not done until every item below is true (this is the bar for "100% end-to-end"):

Functionality: acceptance criteria met; happy path + documented edge cases handled.
Tests: unit + integration written first or alongside, all green; no weakened/deleted tests; coverage targets met (router ≥95% core).
Verify gate: the phase's named gate command passes locally (and in CI where applicable).
Idempotency & recovery: re-runs are safe; crash mid-step recovers (lease/idempotency).
Security review: secret-leak scan clean; scope guardrail honored; least-privilege tokens.
Observability: events/logs/metrics emitted; failures are diagnosable from the control plane.
Docs: this roadmap's checkboxes ticked; README/AGENTS updated; manifest/profile docs current.
Backward-compat: existing .md/Phase-0 behavior unbroken (regression check).
Drift checks: shared-infra templates (.npmrc, docker-prep) untouched/synced; conventional commits.
No console.log/print in service code; req.log/os.Logger used; ESM .js imports.

17. Observability & control plane details

Log transport/storage: factory ships logs to blob (@bytelyst/blob); fleet_events carries pointers + a recent-tail buffer. The control plane serves stored tail + live append (via the streaming route, not the buffering proxy — §10).
Live logs via SSE (single stream contract) from the streaming endpoint to web/TUI.
Metrics: queue depth, blocked count, assign latency, claim-loop RU/s, run duration, verify pass-rate, cost, factory utilization, fairness, reclaim/fencing-rejection counts.
Alerting: stall (no log N min), failure spikes, budget breach, factory offline, dead-letter, claim-race anomalies, RU throttling (Cosmos 429s).
Tracing: a job's full timeline (queued→…→shipped) reconstructable from fleet_events (immutable, ordered).
Cost burndown per job/product/day with budget overlays.
SLOs defined + dashboarded (see §19 targets); error budget tracked per SLO.

18. Risks & gaps explicitly tracked (expert call-outs)

Duplicate execution across transports (git fallback + service) — idempotency-key (submit) + atomic lease (assign) + fencing token (side-effect) + distributed lock (push).
Crash recovery — coordinator lease reaper + fencing (not Cosmos TTL); checkpoint long jobs where engines allow.
Split-brain / partition — fencing rejects stale leaseEpoch writes; reclaimed-job results quarantined, not auto-merged (§9).
Shared-package conflicts — two jobs editing @bytelyst/* simultaneously → fleet-wide lock + reviewer gate.
Starvation/fairness — per-product + per-factory counters with penalty.
Cost runaway — budget.wall hard ceiling everywhere; usd/tokens best-effort (provider metering) + global kill switch.
Cosmos RU throttling (429) — hot claim path; bound via long-poll/backoff + indexing (§13/§22); broker offload at P4.
Clock skew — coordinator-authoritative timestamps for all lease/SLA math (§4).
Tool-version drift / reproducibility — record engine + tool versions per run; pin where possible.
Windows quirks — path/shell differences in the factory agent; capability-gate Windows-only work.
Human-review bottleneck — auto-verify as much as possible; batch review UI; reviewer routing.
Result capture beyond commits — artifacts (coverage, screenshots, build logs) attached to runs.
Secret sprawl — never in queue/manifest/logs/Cosmos; presence-only capabilities.
Data retention — event/log retention + archival policy (extend today's clean).
Engine API churn — engines mapped in one place (build_agent_cmd); capability matrix versioned.

19. Success metrics

Each metric has a provisional SLO target (tune with real data; tracked with an error budget):

Dimension	Metric	Provisional SLO target
Throughput	jobs shipped/day; parallel utilization	utilization ≥ 60% under backlog
Quality	% auto-verified; first-pass success; escaped-defect; post-agent human-edit rate	first-pass ≥ 70%; escaped-defect < 2%
Speed	assign latency; time queued→shipped (excl. human gate)	assign p95 < 5s; queue-wait p95 < 2m at target load
Cost	$/shipped job; budget-breach rate	budget-breach < 1% of jobs
Reliability	lease-reclaim success; dead-letter rate; factory uptime; double-execution incidents	reclaim success ≥ 99.9%; double-merge = 0; dead-letter < 5%
Fairness	max/min product wait-time ratio	ratio < 3×
Correctness	atomic-claim violations; fencing rejections functioning	claim violations = 0

Targets are starting points; the §0 owners ratify per-phase SLOs before that phase's exit.

20. Open questions

Copilot headless feasibility as an engine/station (CLI/automation surface?).
Who owns merge/push authority — agents open PRs only, or auto-merge on green for low-risk profiles?
Multi-user/tenant: per-user queues + RBAC in the control plane?
On-call/ownership for the fleet (alerts routing, runbooks)?
Cloud factory provisioning (Phase 4) — which provider/runtime, cost guardrails?
Profile authorship/governance — who can create/edit profiles, and review of persona prompts?

21. Rollout, rollback & data migration

Each phase ships behind controls so it can be turned off without losing work.

Feature-flagged rollout: gate each phase's new path behind a platform feature flag (fleet.enabled, fleet.route_via_service, fleet.tracker_sync); default off; enable per-product first.
Dual-run / shadow: P2 coordinator runs in shadow (assign decisions logged, not executed) alongside the P0/P1 path before cutover; compare decisions.
Cutover is reversible: a factory can fall back from service-claim to git-queue via flag; no schema-destructive step on the rollback path.
Data migration: introducing Cosmos containers (P2) is additive — no migration of existing tracker data; backfill is read-only (link tracker-item, don't mutate). Container creation is idempotent (registered in cosmos-init).
Backward-compat gate: every phase re-runs Phase-0 selftest.sh + a corpus of legacy .md files (regression).
Rollback drill: each phase's exit includes a tested rollback (flag off → prior behavior, in-flight jobs drain or requeue cleanly).
Acceptance: flipping fleet.* flags off returns the system to the prior phase's behavior with zero data loss; in-flight jobs either complete or requeue.
Verify gate: rollout/rollback drill documented + a flag-off regression run is green.

22. Capacity planning & cost

Concurrency model: fleet throughput = Σ factory free-stations, bounded by per-engine seat limits (e.g. N Devin seats) — document seat inventory per engine before P2.
Cosmos RU budgeting: the claim/heartbeat paths are the hot loops. Estimate RU/s = (factories × claim-poll rate × query RU) + (factories × heartbeat rate × upsert RU); pick long-poll interval to keep steady-state RU within a provisioned budget; enable autoscale RU with a ceiling + 429 alerting.
Polling vs push: at F factories the poll RU grows linearly — define the F threshold that triggers the P4 broker migration.
Blob storage: logs/artifacts sizing + lifecycle (hot → cool → delete) per retention policy (§18).
Factory sizing: per-OS resource baseline (CPU/RAM/disk for N concurrent agent sessions + warm checkouts); disk pressure as a health input.
Cost guardrails: per-product spend caps + alerts; ties to budget and the global kill-switch.
Acceptance: a documented capacity sheet (seats, RU/s, blob GB, factory specs) sized for the target steady-state + 2× burst.
Verify gate: load test sustains target throughput within the RU/cost budget (no 429 storms).

23. Ownership & RACI

Owners are roles, not names — assign before each phase starts (this removes the "undefined owner" gap).

Area	Responsible (R)	Accountable (A)	Consulted (C)	Informed (I)
Runner / factory agent (bash)	DevOps eng	Platform lead	—	All
Coordinator module (platform-service)	Backend eng	Platform lead	Security	All
Scheduler/router	Distributed-systems eng	Platform lead	Backend	All
Control plane (tracker-web Fleet)	Frontend eng	Platform lead	UX	All
Security/governance	Security eng	Security lead	Platform	All
Capacity/cost & SLOs	SRE	Platform lead	Finance	All
Profiles & persona governance	Eng leads	Platform lead	—	All

Each phase names its R/A before kickoff; SLOs (§19) ratified by A.
On-call + runbooks established before the fleet runs unattended yolo workloads (Phase 2+).

24. Work hierarchy & composite delegation (roadmap / epic)

Goal: delegate work at any granularity — a single bug/feature/task, or an entire roadmap — and let the fleet decompose + orchestrate rather than hand a multi-day roadmap to one agent session (which is long-horizon, low first-pass-success, and high blast-radius under yolo).

24.1 Two delegation modes

Atomic (today's model): one leaf item (bug/feature/task) → one job → one agent at one station.
Composite (new): a roadmap/epic → a planner profile expands it into child jobs → the scheduler runs them as a DAG across factories/agents/profiles, honoring deps + phase gates. "Delegate the whole roadmap" = hand it to the orchestrator, which fans out — never one agent grinding for hours.

24.2 Job `kind` — the one genuinely new concept

A new axis, orthogonal to tracker type:

kind: leaf — runs an engine at a station (everything Phase 1–2 already does).
kind: composite — runs the planner/orchestrator that emits child leaf jobs and a dependency graph; it never itself edits a repo.

The scheduler (§7) routes by kind: leaf → station/engine; composite → planner. This keeps execution and planning cleanly separated.

24.3 Hierarchy & relationships

parentId links a child job/item to its roadmap/epic; deps (§5) expresses ordering within it (DAG, submit-time cycle detection).
A roadmap is, mechanically, a named DAG of jobs + a rollup — it reuses deps, profiles (§6), the scheduler (§7), and the lifecycle (§11); the only additions are kind, parentId, and rollup logic.
Add a planner/architect/tech-lead profile (§6 catalog) for decomposition + orchestration; leaf work still uses backend-engineer, ux-designer, etc.

24.4 Rollup semantics (composite-level)

Status rollup: roadmap status is derived from children — in_progress once any child starts; shipped/done only when all children reach shipped; surfaces blocked/failed children for triage.
Budget rollup: roadmap budget = Σ child budgets with an explicit ceiling; breaching the ceiling pauses fan-out (ties to §12 kill-switch).
Verify rollup: each leaf runs its own verify; the roadmap's acceptance gate runs after all leaves pass (e.g. an integration/e2e gate).
Phase gates: the roadmap's own phase Exit-criteria become runtime gates — fan-out of phase N+1 is blocked until phase N's children ship; human approval between phases is the default for yolo safety.
Idempotent re-run: re-running a roadmap skips already-shipped children (content-hash dedupe, §5); only unfinished/changed children re-queue.

24.5 Source-of-truth & sync (no drift)

Composite work obeys the same SoT discipline as the core contract (§4 immutable manifest) and the tracker echo (§10): a roadmap/epic is one record referenced by many, never duplicated.

The roadmap/epic is the SoT for what/why + rollup status; each leaf job/run is the SoT for its execution.
Children reference the parent by parentId; the planner writes the child set once at decomposition (immutable manifest snapshot). Re-planning creates a new revision, it does not mutate in-flight children.
Status flows one way, child → parent → tracker (the §10 echo); humans never hand-edit rollup state.

24.6 Decision — Hybrid (recorded)

Model composite delegation in the fleet layer now; defer the shared-platform enum change until proven.

Now (fleet-owned): add kind (leaf/composite), parentId, and rollup to the fleet_jobs schema (§13). The fleet owns this schema outright — no cross-product risk.
Tracker stays bug/feature/task (the shared ITEM_TYPES used by all 9 products is unchanged). A roadmap is represented by a parent item + label kind:roadmap + parentId on children — zero platform migration, no sign-off needed.
Later (optional, gated on proven value): promote kind:roadmap → a first-class epic tracker type via an additive migration (backfill items where labels contains kind:roadmap into type: epic, keep the label as an alias during transition). Low-risk because the behavior already works fleet-side.
Rationale: avoids a speculative 9-product platform change (UI/filters/stats/tests) before the orchestration model is validated; if the model is wrong, only fleet code is refactored, not a platform enum every product depends on.

24.7 Phasing & gates

P1–P2: leaf-only (no composite); kind defaults to leaf.
P3: composite scheduling + rollup + DAG view in the control plane, with manual decomposition (a human/author defines the child set).
P3→P5: the auto-decomposition planner agent (itself a composite job run by the planner profile) — start manual, automate once trustworthy.
Acceptance: a roadmap with N child jobs fans out across ≥2 factories, respects deps + phase gates, rolls up status/budget correctly, and a re-run skips shipped children; tracker shows the parent moving in_progress → done via the one-way echo.
Verify gate: composite-orchestration tests — DAG expansion, rollup status/budget, phase-gate blocking, idempotent re-run; control-plane e2e for the roadmap DAG view.

This document is the single source of truth for the gigafactory build. Keep the §0 table and per-phase checkboxes updated; a phase ships only when its Exit criteria and the §16 Definition-of-Done rubric are fully green.

51 KiB Raw Blame History Unescape Escape