Commit Graph

32 Commits

Author SHA1 Message Date
Saravanakumar D
237481247e docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every
reference (README, system-overview code-map, and all phase job specs). Add an
index README that lists the docs and points to the companion docs in
learning_ai_common_plat. Docs-only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:21:31 -07:00
Saravanakumar D
257efcb4bc docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:01:23 -07:00
Saravanakumar D
71e5ad6923 docs(gigafactory): add system overview with architecture diagrams; sync roadmap status
Add GIGAFACTORY_SYSTEM_OVERVIEW.md — a current-state companion to the roadmap
spec covering: what the Agent Gigafactory is, a completion snapshot, three
Mermaid diagrams (component architecture, job-lifecycle state machine, atomic
claim + lease-fencing sequence), the Cosmos data model, the scoring router,
subsystem map, full /fleet REST surface, feature flags, the two control planes,
a cross-repo code map, test coverage, next steps (Phase 4/5), and an honest
bugs/gaps/risks section. All three Mermaid blocks validated with mermaid.parse.

Also correct documentation drift in GIGAFACTORY_ROADMAP.md found during the
review:
- §0 progress table showed Phase 3 as "0% not started" while every Phase-3 box
  is ticked; updated phases 1-3 to done with realistic percentages.
- Phase-2 boxes "scheduler/router wired into assignment", "tracker adapter
  direct call", and "factory enrollment + scoped tokens" are implemented in
  common-plat (coordinator.ts uses selectJob; routes.ts enforces
  enrollment.enforceFactoryToken; tracker-bridge.ts) but were left unticked —
  ticked with evidence and refreshed the stale "remaining for 100%" notes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 20:11:02 -07:00
Saravanakumar D
66c91233da feat(agent-queue): re-point TUI dashboard at /fleet API (parity)
Add an opt-in fleet mode to the dashboard so an operator can drive the
coordinator fleet from the same TUI used for the local folder queue.

- lib/fleet-dash.mjs: dependency-injectable read/act adapter over the
  platform-service /fleet REST surface (jobs, metrics, factories, events,
  ship/requeue/reject). Pure-ish + fully unit-testable without a live service.
- dashboard.mjs: render + act in fleet mode when AQ_FLEET_DASH=1 — board with
  counts, factories (per-factory rows or metrics aggregate), alerts, running
  (by lease/factory), actionable JOBS with manifest tags, recent, and a
  per-job events log. Single-flight async refresh keeps the last good board on
  failure; ship re-GETs a fresh leaseEpoch before PATCH; run/stop/promote are
  disabled (no safe server contract). Local mode is byte-for-byte unchanged.
- lib/fleet-dash.test.mjs: 22 node:assert assertions (config, stage mapping,
  toBoard, fetch headers/timeout/errors, board assembly + graceful degradation,
  events, job actions) wired into selftest.sh.
- docs: tick the Phase 3 "TUI re-pointed at /fleet" roadmap boxes.

Verified: selftest.sh green (incl. new fleet-dash checks); live non-TTY render
smoke against a stub /fleet server (both factories and metrics-aggregate paths);
local mode unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:47:56 -07:00
Saravanakumar D
7f77e9abc7 feat(agent-queue): enforce budget.wall as a hard wall-clock ceiling
Parse the wall ceiling from the budget manifest map (budget: { wall: <dur> })
and arm it alongside the per-run timeout. Whichever ceiling fires first binds;
the kill is recorded as result=timeout or result=budget_exceeded accordingly.
budget.wall extends timeout: a job with only a budget.wall (no timeout) is now
hard-killed at the ceiling. budget_exceeded is a terminal, non-retryable class
by default and maps to the failed tracker status.

Adds _budget_wall_secs + _effective_kill helpers (pure, unit-tested) and live
selftest coverage; usd/tokens remain best-effort and are not enforced here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:21:49 -07:00
Saravanakumar D
f1fe66fd4d docs(roadmap): tick verified-done Phase 3 boxes (395-400,402)
Phase 3 fleet control plane is implemented in learning_ai_common_plat:
fleet API client, fleet map page, job table/detail/DAG/SSE/actions, cost
burndown + multi-reviewer gate, scoring explainability, preemption, and
Playwright fleet e2e. Box 401 (TUI re-point) remains open.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:13:25 -07:00
saravanakumardb1
a075a6ff30 Merge: Phase 2 two-factory parallel demo — exit criteria (§14) (#demo) 2026-05-30 01:58:55 -07:00
saravanakumardb1
0cde7def6a feat(agent-queue): two-factory parallel demo — Phase 2 exit criteria (§14)
Close the final Phase-2 exit-criteria box: >=2 factories executing jobs in parallel
through one coordinator, proving the concurrency guarantees end-to-end. This is a
DEMO HARNESS over the existing runtime — agent-queue.sh and lib/fleet-client.sh are
unchanged (read + called, not modified).

demo/two-factory-demo.sh: starts two real `agent-queue.sh run` daemons (mac-1 +
ubuntu-1, separate queues/cwds) that compete ONLY through the coordinator, then
asserts: (a) no double-assign — each of 3 jobs executed by exactly one factory;
(b) fencing + reclaim — kill a factory mid-job, the reaper returns its job, the
survivor reclaims + completes it, and the dead worker's late/zombie report (stale
leaseEpoch) is FENCED (HTTP 409, never shipped); (c) parallelism — both factories
hold active jobs concurrently. Dual-mode: CI-safe stateful stub by default; live
platform-service when AQ_FLEET_API/AQ_FLEET_TOKEN set.

demo/coordinator-stub.sh: stateful, mkdir-lock-guarded, file-backed coordinator
implementing claim/lease/fence/renew/release + reaper-reclaim via the existing
AQ_FLEET_API_CMD seam — the selftest stub pattern extended with shared state so
>=2 processes coordinate through one coordinator.

demo/README.md: stub + real invocations, env knobs, what each guarantee proves,
what-to-watch guide.

selftest.sh: +3 headless stub-mode checks (existing 68 unchanged byte-for-byte ->
71 total green).

docs/GIGAFACTORY_ROADMAP.md: tick the §14 two-factory-demo box; annotate Phase-2
exit criteria; bump §0 Phase 2 to 80% (remaining: scheduler-core wiring [common-plat
PR #31], tracker-direct call, factory enrollment).

bash 3.2 + awk/sed/grep/pgrep only; mac+linux safe; no new runtime deps.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 01:53:36 -07:00
saravanakumardb1
2d76af916d docs(agent-queue): add Phase 3 overnight (10h) job — tunable scoring+preemption, DAG, budgets, tracker-web control plane 2026-05-30 01:48:39 -07:00
saravanakumardb1
08d8d715a1 docs(agent-queue): add Dependabot dependency-triage prompt for common-plat 2026-05-30 00:56:55 -07:00
saravanakumardb1
24fe1567f6 docs(agent-queue): draft Phase 2 next prompts — direct tracker->module wiring (§10) + two-factory parallel demo (exit criteria) 2026-05-30 00:40:21 -07:00
saravanakumardb1
fbecbe82b6 feat(agent-queue): fleet feature flags + shadow/dual-run (Phase 2)
Add a safe, reversible path to validate the fleet coordinator against the proven
single-host path BEFORE cutover, via three independently-toggleable flags:
  AQ_FLEET=0          pure offline (zero coordinator calls; offline path unchanged)
  AQ_FLEET_ROUTE=1    route_via_service: coordinator authoritative for claim (default = P2-S3)
  AQ_FLEET_ROUTE=0    local inbox authoritative (coordinator not used to source work)
  AQ_FLEET_SHADOW=1   dual-run (needs AQ_FLEET=1 + ROUTE=0): query coordinator in parallel,
                      record divergence, NEVER act on it
Precedence: SHADOW only when ROUTE=0; if ROUTE=1 + SHADOW=1, ROUTE wins (one-shot warning).

lib/fleet-client.sh: fleet_route_enabled / fleet_shadow_enabled / fleet_flags_warn_once /
fleet_flags_state; fleet_shadow_claim (read-only — isolated `-shadow` factoryId +
dryRun, releases any real lease, never materializes), fleet_shadow_compare
(AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY → .state/fleet-shadow.log), fleet_shadow_report
(shadow:true, response never acted on), cmd_fleet_shadow_report (counts + agreement rate).

agent-queue.sh: ROUTE-gate claim sourcing (claim only when route_via_service);
shadow hook after the local authoritative decision each iteration (best-effort,
error-swallowed — shadow can never fail a real job); `fleet-shadow-report` subcommand
+ help; resolved flags surfaced in `status`/`fleet-status`. tryClaim/fence/offline
paths unchanged.

Strictly side-effect-free on real job state: shadow never ships, quarantines, or
mutates real jobs. Offline path byte-for-byte unchanged when AQ_FLEET=0.

selftest.sh: +8 checks (shadow AGREE/DIVERGE/COORD_EMPTY, non-fatal 5xx, ROUTE
precedence, ROUTE=0 local-authoritative, fleet-shadow-report summary, shadow_report
unit). 60 prior checks unchanged → 68 total green. README + GIGAFACTORY_ROADMAP
document the flag model + cutover ladder.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 00:22:48 -07:00
saravanakumardb1
5c0ae020c0 docs(agent-queue): draft P2 prompts — factory enrollment+tokens (§12) + feature flags/shadow-dualrun 2026-05-29 23:52:14 -07:00
saravanakumardb1
21ebf8b1b7 docs(agent-queue): fleet integration section + roadmap P2-S3 ticks
README: "Fleet integration (Phase 2)" — AQ_FLEET flag, env table, claim/heartbeat/
report/fence/renew protocol, offline-degrade + quarantine, offline-vs-fleet explainer.
Roadmap: tick the Phase-2 §14 factory-agent item, add a P2-S3 slice note, bump §0
Phase 2 -> 55%.
2026-05-29 22:45:44 -07:00
saravanakumardb1
10395983e7 docs(agent-queue): draft parallel P2 prompts — scheduler/router core (§7) + fleet artifacts blob wiring (§13) 2026-05-29 22:32:41 -07:00
saravanakumardb1
9a073ef225 docs(agent-queue): draft P2-S3 factory-agent integration prompt (claim/heartbeat/report/fence behind AQ_FLEET) 2026-05-29 22:03:12 -07:00
saravanakumardb1
8ae504ca30 docs(agent-queue): tracker integration + close Phase 1 §10/§14 adapter (P1-S4)
README: Tracker integration section (from-tracker/to-tracker, env config, label->manifest table, one-way-echo rule, AQ_TRACKER_AUTO, real-use note). Roadmap: tick §10 Phase-1 adapter items + the §14 tracker-adapter item; add P1-S4 slice note; §0 Phase 1 -> 95% (remaining: budget.wall + Node dash surfacing).
2026-05-29 21:35:16 -07:00
saravanakumardb1
d0348f23de docs(agent-queue): P0 atomic-claim resolved (PR #29) — tick §4/§13/§14 fleet items 2026-05-29 21:05:38 -07:00
saravanakumardb1
2e9bd4dd1e docs(agent-queue): record P2 Foundation merged + track P0 atomic-claim hardening (§4)
- §4: implementation-status note — fleet module merged (PR #28); atomic claim NOT
  yet concurrency-safe (rev-CAS over unconditional write, sequential-only test)
- add phase2-atomic-claim-hardening.md: updateIfMatch in @bytelyst/datastore
  (Cosmos If-Match + process-atomic memory) + concurrent claim tests
2026-05-29 20:43:28 -07:00
saravanakumardb1
0e94705ab7 docs(agent-queue): draft Phase 2 Foundation long-run prompt (fleet module + coordinator: claim/lease/fencing/reaper) 2026-05-29 19:54:33 -07:00
saravanakumardb1
e183919c60 docs(agent-queue): profiles + deps docs; tick §5/§6 + bump Phase 1 to 80% (P1-S2)
README: Profiles & deps section (resolution precedence, persona, allowed-scope warn-only, deps/blocked + cycle detection); manifest table moves profile/deps/deps-mode to active. Roadmap: tick §6 catalog/persona/inheritance/allowed-scope and §5 deps + the §14 profile/deps/scope boxes; add P1-S2 slice note; §0 Phase 1 -> 80%.
2026-05-29 19:26:33 -07:00
saravanakumardb1
7c4f5bc9b0 docs(agent-queue): draft Slice 4 (tracker adapter) + Phase 2 Slice 1 (fleet data model) 2026-05-29 19:11:09 -07:00
saravanakumardb1
0443590ce4 docs(agent-queue): update Slice 2 prompt — branch off main (Slice 1+3 merged) 2026-05-29 19:05:34 -07:00
saravanakumardb1
87a4bf591a docs(agent-queue): Resilience + Insights docs; tick §11/§25/§26 single-host (P1-S3)
README: Resilience + Insights sections, retry frontmatter active (manifest table), retries_exhausted/recovered result values, recover/insights commands, honest token caveat. Roadmap: tick fully-completed single-host boxes in §11/§25/§26 with annotations; bump §0 Phase 1 to 55%.
2026-05-29 18:43:38 -07:00
saravanakumardb1
bc0c0e263c Merge PR #1: Phase 1 Slice 1 — evolved manifest, priority, capabilities, engine-class, idempotency
Reviewed against the diff (capability gate before launch, 3-pass idempotency,
priority+age selection, engine-class resolution, timeout/flock launch). selftest 18/18.
2026-05-29 18:12:41 -07:00
saravanakumardb1
1f18f5d7a3 docs(agent-queue): add Phase 1 Slice 3 prompt (resilience & insights, single host) 2026-05-29 18:10:43 -07:00
saravanakumardb1
beb225162a docs(agent-queue): add durability/crash-recovery (§25) + execution insights/token accounting (§26)
- §13: fleet_jobs stores instruction bodyMd (durable md SoT) + checkpoint;
  fleet_runs carries token/cost/model/tool/diff metrics
- §25: instructions durable in Cosmos md, WIP checkpoint branch aq/wip/<jobId>,
  orphan detection, resume-vs-restart, fencing, retry->dead_letter, crash taxonomy
- §26: per-run token/cost/latency/tool insights, honest metered-vs-estimated
  capture, rollups, control-plane surfacing, secret redaction
- feature-catalog rows for §25 and §26
2026-05-29 18:09:32 -07:00
saravanakumardb1
470b2ce8d0 docs(agent-queue): version Phase 1 slice prompts (slice1, slice2)
Track the delegated agent task prompts under docs/jobs/ so the slice
decomposition of the gigafactory roadmap is reproducible and reviewable.
2026-05-29 18:05:06 -07:00
saravanakumardb1
67d8aa5766 docs(agent-queue): add work hierarchy & composite delegation (roadmap/epic)
New §24 + feature-catalog row:
- two delegation modes: atomic (leaf bug/feature/task) vs composite (roadmap/epic)
- introduce job kind (leaf|composite); composite routes to a planner/orchestrator
  that fans out child leaf jobs as a DAG across factories/agents/profiles
- parentId hierarchy + rollup semantics (status/budget/verify/phase-gates) +
  idempotent re-run (skip shipped children)
- source-of-truth/sync discipline (one record referenced by many; one-way echo)
- HYBRID decision recorded: model kind/parentId/rollup in the fleet layer now,
  keep shared tracker ITEM_TYPES unchanged (label kind:roadmap), promote to a
  first-class epic type later via additive migration once proven
- phasing: leaf-only P1-P2; manual composite P3; auto-decomposition planner P3->P5
2026-05-29 18:02:10 -07:00
saravanakumardb1
a9c69b1dce docs(agent-queue): manifest field table (active vs reserved) + tick Phase 1 Slice 1 (P1-S1)
- README: new "Manifest fields (Gigafactory Phase 1)" table marking ACTIVE vs
  RESERVED, capability-grammar table, idempotency-key semantics, copilot engine
  mapping, COPILOT_BIN, and capability_mismatch/no_engine result values.
- GIGAFACTORY_ROADMAP: tick only the fully-completed P1 boxes (frontmatter
  parsing, capability detect+match, priority, backward-compat, capability
  grammar, engine-class taxonomy, idempotency-key semantics, README/progress),
  annotate partials, and bump §0 Phase 1 to in-progress 35%.
2026-05-29 17:44:37 -07:00
saravanakumardb1
3ad9500623 docs(agent-queue): harden gigafactory roadmap after principal review
Fix correctness/distributed-systems bugs and fill gaps in place:
- atomic claim (optimistic concurrency/_etag), fencing token (leaseEpoch),
  coordinator-authoritative time added to core contract + scheduler + factory
- lease reclaim via coordinator reaper, not Cosmos TTL (TTL only GCs rows)
- split-brain/partition safety: fencing + distributed lock + quarantine
- budget: wall is the only hard ceiling; usd/tokens best-effort (provider metering)
- SSE live logs cannot use the buffering tracker proxy; use a streaming route +
  blob log storage (fleet_artifacts container)
- manifest: capability grammar, engine-class enum, idempotency 409 + deps-satisfied
  semantics, dep cycle detection
- tracker status mapping table + PR-flow ship semantics (merged+green vs pr-opened)
- station/seat capacity, factory health definition, enrollment/bootstrap auth
- Cosmos RU/indexing + claim-loop poll cost; add new sections: rollout/rollback &
  data migration (§21), capacity planning & cost (§22), ownership & RACI (§23)
- success metrics now carry provisional SLO targets; Phase 2 checklist + index synced
2026-05-29 17:15:28 -07:00
saravanakumardb1
90366e59bb docs(agent-queue): add gigafactory vision + checklist implementation roadmap
- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision
  (factory x tool x profile routing) as a checklist-driven, phased
  implementation roadmap (Phase 0-5) with acceptance criteria, verify
  gates, and a 100% Definition-of-Done rubric
- committed path: coordinator as a platform-service module + control
  plane on tracker-web, reached via a thin tracker adapter first; bash
  runner survives as the offline edge factory agent
- README: add vision/roadmap pointer
2026-05-29 17:06:32 -07:00