Commit Graph

238 Commits

Author SHA1 Message Date
Hermes VM
3ee4e7104e fix(dashboard): Phase 5 P0 — correct CI workspace path + real ESLint
- ci.yml: actions/checkout into the runner workspace instead of cd-ing into a
  hard-coded host path and `git reset --hard origin/main` on the live checkout;
  install via `pnpm install:gitea` (self-contained, no sibling common-plat
  checkout); E2E step left as a TODO pointer (ci-e2e-hardening, Phase 5 P2).
- Fix the same stale /opt/bytelyst/bytelyst-devops-tools path in deploy.sh,
  scripts/deploy-hotcopy.sh, DEPLOYMENT.md, DEPLOYMENT_GUIDE.md.
- Replace the no-op `lint` echoes with real ESLint 9 flat configs (js +
  typescript-eslint recommended) for backend and web; add a root `pnpm lint`.
- Fix the 10 errors lint surfaced, incl. require('os') in an ESM backend
  (system/repository.ts -> import * as os), prefer-const x4, and a ternary
  expression-statement in web vm/page.tsx.

Verified locally: secret-scan, lint (0 errors; correctly fails on bad code),
typecheck, unit tests (backend 9 / web 11), and build all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:50:32 +00:00
saravanakumardb1
51e4c5f271 Merge PR #5: Phase 2 Slice 3 — factory-agent integration
Wires agent-queue.sh to the fleet coordinator behind AQ_FLEET=1 (offline path
unchanged when off). Fencing-aware (stale leaseEpoch -> self-abort + quarantine),
offline-degrade, tracker echo via fleet_events. selftest 60/60 (53 prior + 7 new);
token env-sourced; no bodyMd/token leak (asserted).
2026-05-29 22:50:56 -07:00
saravanakumardb1
21ebf8b1b7 docs(agent-queue): fleet integration section + roadmap P2-S3 ticks
README: "Fleet integration (Phase 2)" — AQ_FLEET flag, env table, claim/heartbeat/
report/fence/renew protocol, offline-degrade + quarantine, offline-vs-fleet explainer.
Roadmap: tick the Phase-2 §14 factory-agent item, add a P2-S3 slice note, bump §0
Phase 2 -> 55%.
2026-05-29 22:45:44 -07:00
saravanakumardb1
064dbf3d8f test(agent-queue): fleet integration selftest cases (P2-S3)
Adds 7 stub-driven fleet cases (AQ_FLEET_API_CMD stub, no live coordinator); never
weakens the prior 53 (full suite now 60 green):
- flag OFF (default): zero coordinator calls; offline job completes unchanged
- register(heartbeat)+claim -> coordinator job materialized + executed to review/
- report+checkpoint: PATCH carries stage+leaseEpoch (+ wipBranch on building)
- FENCING: stale-epoch 409 -> self-abort + quarantine (never shipped)
- lease renew (unit): POST .../lease/renew with current leaseEpoch
- offline-degrade: coordinator 5xx -> job completes locally (degraded), not quarantined
- no-leak: bodyMd/token never appear in report payloads
2026-05-29 22:45:44 -07:00
saravanakumardb1
1d84712b47 feat(agent-queue): wire runner to fleet coordinator at minimal hook points (P2-S3)
Sources lib/fleet-client.sh and adds a few fleet_enabled-gated hooks so the offline
git-queue path is byte-for-byte unchanged when AQ_FLEET is unset/0:

- cmd_run: register at loop start; per-iteration heartbeat (cadence) + lease renew
  for in-flight fleet jobs + claim one coordinator job into inbox when capacity.
- meta: persist fleet_job_id + fleet_lease_epoch (from claim frontmatter).
- run_worker: report `building` (with WIP checkpoint) after WIP setup and `review`
  before accepting the agent's output — a FENCED (stale-epoch/409) report self-aborts
  and quarantines (never ships); 5xx/unreachable degrades (finish locally).
- _auto_echo: for fleet jobs route the outcome echo through the coordinator
  (fleet_events) instead of the direct tracker echo; offline jobs unchanged.
- cmd_ship: fence-check before shipping a fleet job; release lease after.
- status: show factory id + per-job fleet=<id>@e<epoch>; insights lists fleet_* fields.
- dispatch + help: `fleet-status` command + a FLEET env section.
2026-05-29 22:45:44 -07:00
saravanakumardb1
a10d4003e6 feat(agent-queue): fleet coordinator client library (lib/fleet-client.sh, P2-S3)
New sourced library implementing the factory side of the Phase-2 `fleet`
coordinator contract — curl-only + POSIX awk, reusing the Slice-4 HTTP/JSON
helper patterns, no new deps. Every function is a no-op unless AQ_FLEET=1.

- fleet_enabled / fleet_api (AQ_FLEET_API_CMD test seam) / _fleet_call
- fleet_detect_caps (reuses detect_capabilities) -> JSON caps array
- fleet_heartbeat (+ _maybe cadence): registration == first heartbeat
- fleet_claim: POST /fleet/claim, parse job id/bodyMd/leaseEpoch, materialize a
  transient local .md (fleet-job-id + fleet-lease-epoch in frontmatter)
- fleet_report: PATCH fenced stage transition {stage, leaseEpoch, checkpoint?};
  returns ok / FENCED(2, stale epoch -> self-abort) / degraded(1, unreachable)
- fleet_lease_renew / fleet_lease_release / fleet_renew_active (fenced)
- fleet_quarantine: park a reclaimed (fenced) job in failed/ for human triage
- cmd_fleet_status: register + print factory identity/caps

Report payloads carry only stage/epoch/checkpoint — never prompt/bodyMd/token.
2026-05-29 22:45:44 -07:00
saravanakumardb1
10395983e7 docs(agent-queue): draft parallel P2 prompts — scheduler/router core (§7) + fleet artifacts blob wiring (§13) 2026-05-29 22:32:41 -07:00
Hermes VM
a8dd166108 docs: add Hermes dashboard v2 roadmap + CI/E2E delegation brief
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
13a105ba23 feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert
vm-health-check.sh:
- check_gpu(): nvidia-smi probe; "CPU-only" OK on this VM (no GPU)
- check_image_freshness(): flag containers running images >30d old.
  Skips third-party images (gitea, grafana, prom, mcr.microsoft, axllent,
  caddy, traefik, valkey, cadvisor) — they have their own rebuild cadence.
  Currently flags 19 stale product images (~60d old).

chaos-validation.sh:
- Monthly chaos test: kill PID 1 in chronomind-web, wait up to 35 min
  for docker-health-watchdog to detect + restart. Telegram pass/fail.
- Refuses to run if target not healthy. systemd timer fires 1st of month
  at 10:00 UTC (after 08:00 weekly digest).

vm-io-anomaly-check.sh:
- 6h avg sda write rate; transition alerts at WARN (1 GB/hr) /
  CRIT (2.5 GB/hr). De-dupes via /var/log/vm-io-anomaly-state so the
  alert fires once per transition, not every 6h. Current baseline:
  ~1.94 GB/hr (orphan-container state-file writes; see Phase 0.3).
- Reports recovery to OK when rate drops back.

vm/page.tsx: gpu + image_freshness added to CHECK_META so they render
with proper icon/label and slot into CHECK_ORDER.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
76ef17f26b feat(vm): Phase 2.3 closure — OOM watchdog + orphan-container docs
OOM watchdog:
- vm-oom-watchdog.sh — scans journalctl -k since cursor for oom-kill,
  killed-process, and "out of memory ... killed" entries; maps cgroup
  hits back to container names via docker inspect; posts a single
  Telegram alert per scan window (no dedupe needed — cursor advances
  on every run). Cursor at /var/log/vm-oom-cursor, log at
  /var/log/vm-oom-watchdog.log.
- Systemd: OnBootSec=10min, OnUnitActiveSec=1h, Persistent=true.

Orphan containers (no compose file on disk):
- trading-backend → docker update --memory=768m (high-I/O bot)
- gitea-npm-registry → docker update --memory=512m
- orphan-containers.md captures canonical configs for recovery
  (env, mounts, networks, restart policy, memory limits).

Closes Phase 2.3 (post-monitoring) and Phase 3.3 (orphan limits).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
253e888a24 feat(infra): Phase 2.3 — memory limits across all active Docker stacks
Apply deploy.resources.limits.memory to 45 services across 5 compose files.
Limits take effect on next docker compose up (no running containers affected).

Limits derived from 2-day Prometheus RSS baseline (avg of 2026-05-27-29):

  common_plat ecosystem (37 services):
    cosmos-emulator: 1g   (319 MiB baseline, can spike on writes)
    loki:           384m  (75 MiB)
    prometheus:     384m  (91 MiB, grows with series cardinality)
    node-exporter:  128m  (21 MiB, very stable)
    cadvisor:       256m  (38 MiB)
    valkey:         128m  (tiny)
    caddy:          256m  (35 MiB)
    platform-service: 512m (61 MiB)
    extraction-service: 512m (99 MiB, Python sidecar)
    mcp-server:     384m  (21 MiB)
    product backends: 512m (30-65 MiB each)
    product webs:   512m  (35-93 MiB each)
    llmlab-dashboard: 512m (Ollama proxy, larger cache budget)

  dashboard (2 services): backend 512m, web 512m
  invttrdg (2 services): backend 768m (159 MiB + heavy state writes),
                          web 256m (nginx SPA)
  clock/chronomind (2 services): backend 512m, web 512m
  notes/notelett (2 services): backend 512m, web 512m

Ollama host process has NO limit (model load unpredictable, up to 8 GB).
trading-backend compose file not on disk — limit not applied.
gitea-npm-registry started manually — limit not applied.

Monitor OOMKill for 48h after next stack restart:
  dmesg | grep -i oom

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
42c3b9cdd5 feat(dashboard/vm): Phase 3.3 — All Containers panel with CPU/RAM, logs, bulk restart
- repository.ts: getAllContainers() — batch docker inspect + docker stats
  --no-stream merged by container name; returns state, health, uptime,
  CPU%, RAM, memLimitMiB (0=no limit), restart count, stack from compose
  label; getContainerLogs() — docker logs --tail --timestamps
- routes.ts: GET /api/vm/containers (all, with stats; ~3s for 38
  containers), GET /api/vm/containers/:name/logs?lines=N
- api.ts: ContainerInfo interface; vmApi.getAllContainers(),
  vmApi.getContainerLogs()
- vm/page.tsx: ContainersPanel — collapsible (lazy-loads on first open);
  filter chips (All/Running/Unhealthy/No Limit) + stack dropdown;
  per-row log viewer (inline pre, dark bg, 50-line tail); per-row
  restart button; bulk "Restart N unhealthy" with confirmation modal;
  Fragment key pattern for row+log-row pairs

I/O anomaly (Phase 0.3) root cause identified: invttrdg-backend and
trading-backend write bot_state.json + .bak on every market tick
(5×/min and 2×/min respectively) into container overlay layer →
~6 GB/day — intentional bot behaviour, no fix needed, trend chart
already in place to monitor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
8d32cb7980 feat(dashboard/vm): Phases 4.1-4.3 — Prometheus trends, sparklines, weekly digest
- prometheus.ts: new Prometheus client with 7d/30d range queries for disk,
  memory, swap, CPU steal, and disk I/O (GB/hr); getWeeklyDigestData()
  aggregates all metrics for digest and API endpoint
- routes.ts: GET /api/vm/metrics/trend?metric=…&range=… and
  GET /api/vm/weekly-digest endpoints
- api.ts: TrendPoint/TrendSeries types; getTrend() and getMemoryTrend()
  added to vmApi
- vm/page.tsx: Sparkline (pure SVG polyline+fill), TrendCard with
  latest/avg/peak and threshold colouring, TrendsPanel with lazy load
  on first open; Promise.allSettled() isolation for all 5 data panels
- vm-weekly-digest.sh: weekly Telegram digest via docker exec into
  devops-backend to reach Prometheus; emoji severity indicators; cron
  summary from /var/log/vm-cleanup.log
- systemd timer: Mon 08:00 UTC, Persistent=true (fires on next boot
  if missed); first trigger 2026-06-02

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
saravanakumardb1
9a073ef225 docs(agent-queue): draft P2-S3 factory-agent integration prompt (claim/heartbeat/report/fence behind AQ_FLEET) 2026-05-29 22:03:12 -07:00
saravanakumardb1
7d6275f935 Merge origin/main (CI + installers) into Slice 4 merge 2026-05-29 21:59:57 -07:00
saravanakumardb1
2ad5c6dee5 Merge PR #4: Phase 1 Slice 4 — tracker adapter (task <-> job round-trip)
Closes Phase 1. selftest 53/53 (46 regression + 7 new); one-way echo sends
metrics only (no prompt/secrets — asserted); token env-sourced; Items API
contract (GET/PATCH status/POST comments, bearer + X-Product-Id).
2026-05-29 21:58:24 -07:00
saravanakumardb1
8ae504ca30 docs(agent-queue): tracker integration + close Phase 1 §10/§14 adapter (P1-S4)
README: Tracker integration section (from-tracker/to-tracker, env config, label->manifest table, one-way-echo rule, AQ_TRACKER_AUTO, real-use note). Roadmap: tick §10 Phase-1 adapter items + the §14 tracker-adapter item; add P1-S4 slice note; §0 Phase 1 -> 95% (remaining: budget.wall + Node dash surfacing).
2026-05-29 21:35:16 -07:00
saravanakumardb1
1e0a17bbc0 test(agent-queue): tracker adapter selftest cases (P1-S4)
Adds (never weakens) 7 stub-driven cases (AQ_TRACKER_API_CMD stub, no live service): from-tracker create + label mapping + idempotent; to-tracker shipped echo (PATCH done + metrics comment, asserts NO prompt body sent) + idempotent; HTTP 500 non-fatal; AQ_TRACKER_AUTO auto-echo on run. Full suite green (53 checks).
2026-05-29 21:35:16 -07:00
saravanakumardb1
b7a9ea1b7a feat(agent-queue): tracker adapter — task <-> job round-trip (P1-S4)
Implements §10 single-host tracker integration, closing the last Phase-1 §14 item:

- tracker_api: one curl-only HTTP wrapper (base URL + bearer + productId header),
  overridable via AQ_TRACKER_API_CMD so tests need no live service. Emits the
  response body + a trailing HTTP-code line; _api_call splits into API_BODY/API_CODE.
- aq from-tracker <ITEM_ID>: GET the Item, map title/description -> job body,
  labels (engine-class:/profile:/priority:/cap:) + Item priority -> frontmatter,
  and stamp tracker-item + a stable idempotency-key tracker-<id>. Materializes a
  .md into inbox/ via cmd_add; idempotent (Slice 1 dedupe) so a re-pull never dups.
  JSON parsed with POSIX awk (no jq) — mac + linux safe.
- aq to-tracker <job>: one-way echo (child -> tracker, §24.5). PATCHes the Item
  status (building/review/testing->in_progress, shipped->done, failures->wont_fix,
  all overridable) and posts a metrics-only comment (result/attempts/duration/
  tokens/cost/diff — NEVER prompt content or secrets). Idempotent via meta
  tracker_echoed; an echo failure (e.g. HTTP 500) is logged and non-fatal — the
  tracker is downstream, never authoritative for execution.
- Opt-in auto-echo (AQ_TRACKER_AUTO=1, default OFF): the worker echoes on each
  transition (building via cmd_run, review/testing/failed via run_worker, shipped
  via ship/promote); never blocks or fails a job.
- status + insights surface tracker-item and the last echoed status.

curl-only HTTP; no new runtime deps; conventional + backward-compatible.
2026-05-29 21:35:06 -07:00
Saravanakumar D
5a278ad119 ci: add GitHub Actions CI (shellcheck, syntax, preview)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:31:00 -07:00
Saravanakumar D
efe0da3169 chore(devops): add cross-platform runners and README; normalize EOLs
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:26:47 -07:00
Saravanakumar D
b6dc0768e3 chore(devops): finalize CLI install report and helper
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:20:52 -07:00
Saravanakumar D
9e28d85a64 chore(devops): update CLI install report and add symlink helper
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:20:52 -07:00
saravanakumardb1
d0348f23de docs(agent-queue): P0 atomic-claim resolved (PR #29) — tick §4/§13/§14 fleet items 2026-05-29 21:05:38 -07:00
saravanakumardb1
2e9bd4dd1e docs(agent-queue): record P2 Foundation merged + track P0 atomic-claim hardening (§4)
- §4: implementation-status note — fleet module merged (PR #28); atomic claim NOT
  yet concurrency-safe (rev-CAS over unconditional write, sequential-only test)
- add phase2-atomic-claim-hardening.md: updateIfMatch in @bytelyst/datastore
  (Cosmos If-Match + process-atomic memory) + concurrent claim tests
2026-05-29 20:43:28 -07:00
saravanakumardb1
0e94705ab7 docs(agent-queue): draft Phase 2 Foundation long-run prompt (fleet module + coordinator: claim/lease/fencing/reaper) 2026-05-29 19:54:33 -07:00
saravanakumardb1
d43cab8afe Merge PR: Phase 1 Slice 2 — profiles (persona/caps/scope inheritance) + deps/DAG (blocked, cycle detection)
Reviewed against §5/§6 single-host scope; selftest 46/46 (34 regression + 12 new);
profile resolution precedence, persona injection, warn-only scope, deps soft/cycle verified.
2026-05-29 19:49:28 -07:00
saravanakumardb1
e183919c60 docs(agent-queue): profiles + deps docs; tick §5/§6 + bump Phase 1 to 80% (P1-S2)
README: Profiles & deps section (resolution precedence, persona, allowed-scope warn-only, deps/blocked + cycle detection); manifest table moves profile/deps/deps-mode to active. Roadmap: tick §6 catalog/persona/inheritance/allowed-scope and §5 deps + the §14 profile/deps/scope boxes; add P1-S2 slice note; §0 Phase 1 -> 80%.
2026-05-29 19:26:33 -07:00
saravanakumardb1
71d8a7cd4e test(agent-queue): profiles + deps/DAG selftest cases (P1-S2)
Adds (never weakens) temp-catalog + temp-git cases: profile verify inheritance + job-override precedence, persona-injection golden, profile capability inheritance, allowed-scope warn-only + path_in_scope unit, deps block->run, deps-mode soft (testing/), and submit-time cycle rejection. Full suite green (46 checks).
2026-05-29 19:26:26 -07:00
saravanakumardb1
f2dabdeb81 feat(agent-queue): starter profile catalog (P1-S2)
profiles/<name>.md presets (name, persona, capabilities, default-verify, engine-class, prefers-engine, allowed-scope, review-policy) for developer, backend-engineer, frontend-engineer, ux-designer, ui-designer, qa, reviewer, docs-writer, and a reserved planner.
2026-05-29 19:26:26 -07:00
saravanakumardb1
3d99f04427 feat(agent-queue): profiles (persona + presets) and single-host deps/DAG (P1-S2)
Implements roadmap §6 (profiles) and §5 deps on the bash runner, backward-compatible
(jobs without profile/deps behave exactly as before).

Profiles (§6):
- profile_get / profile_persona / fm_eff helpers + PROFILES_DIR (AGENT_QUEUE_PROFILES
  override). A job's `profile:` inherits verify (<- default-verify), capabilities,
  engine-class, prefers-engine, allowed-scope, review-policy when the job omits them;
  job fields always override (precedence job > profile > default). Resolution runs via
  fm_eff inside the capability gate and resolve_engine, so inherited caps/engine-class
  take effect before launch.
- persona injection: the profile's persona block is prepended to the stripped body
  fed to the engine (job .md unchanged on disk; nothing secret logged).
- allowed-scope guardrail (WARN-ONLY): scope_check logs a non-blocking WARNING +
  records scope_warning= for changed paths outside the globs; path_in_scope is a
  pure, unit-testable matcher (`dir/**` = subtree).

deps / DAG, single host (§5):
- deps reference other jobs by idempotency-key. dep_satisfied: shipped/ (hard) or
  shipped/+testing/ (deps-mode: soft). deps_unmet drives a block-with-reason skip in
  inbox selection (never launched/failed); cmd_status surfaces "blocked (waiting on
  <keys>)". deps_would_cycle rejects cyclic submits on `add`.
- _drain_pending: `--once` drains past dep-blocked jobs (idle can't satisfy them)
  while still waiting on retry/recovery backoff timers.

Meta now records effective (inherited) capabilities/engine-class/prefers-engine/
review-policy/allowed-scope so `status` reflects resolved config.
2026-05-29 19:26:16 -07:00
saravanakumardb1
7c4f5bc9b0 docs(agent-queue): draft Slice 4 (tracker adapter) + Phase 2 Slice 1 (fleet data model) 2026-05-29 19:11:09 -07:00
saravanakumardb1
0443590ce4 docs(agent-queue): update Slice 2 prompt — branch off main (Slice 1+3 merged) 2026-05-29 19:05:34 -07:00
saravanakumardb1
41f91d7ea1 Merge PR: Phase 1 Slice 3 — single-host resilience (crash recovery, WIP checkpoint/resume, retry) + execution insights
Reviewed against §11/§25/§26 single-host scope; selftest 34/34 (regression intact);
WIP protects current branch + PID-reuse guard on orphan recovery verified.
2026-05-29 19:00:13 -07:00
saravanakumardb1
87a4bf591a docs(agent-queue): Resilience + Insights docs; tick §11/§25/§26 single-host (P1-S3)
README: Resilience + Insights sections, retry frontmatter active (manifest table), retries_exhausted/recovered result values, recover/insights commands, honest token caveat. Roadmap: tick fully-completed single-host boxes in §11/§25/§26 with annotations; bump §0 Phase 1 to 55%.
2026-05-29 18:43:38 -07:00
saravanakumardb1
f46dd38adb test(agent-queue): resilience + insights selftest cases (P1-S3)
Adds (never weakens) temp-git-repo + stub cases: orphan recovery (+idempotent), WIP checkpoint/numstat, non-git skip, WIP resume, retry on verify_failed and crash (incl. no-retry when class absent), parse_usage extraction, per-engine aggregate. Inbox-empty-safe counts; avoids the pipefail+grep -q SIGPIPE trap.
2026-05-29 18:43:30 -07:00
saravanakumardb1
679d8b72cd feat(agent-queue): dashboard insights column for finished jobs (P1-S3)
Read-only from meta: tokens or cost + attempts + line deltas + duration; recognizes the new retries_exhausted result. agent-queue.sh stays the source of truth.
2026-05-29 18:43:30 -07:00
saravanakumardb1
1758bc1ab1 feat(agent-queue): single-host crash recovery, WIP checkpoint/resume, retry + insights (P1-S3)
Implements the single-host bash equivalents of roadmap §25 (durability/crash
recovery) and §26 (execution insights), plus §11 retry/dead-letter stand-in.

Resilience (A1-A4):
- recover_orphans + `recover` command: building/ jobs with a dead worker (dead
  pid, pidstart reuse-guard) are moved back to inbox/ with attempts incremented,
  on `run` startup and each loop. Idempotent (folder location is the guard).
- WIP checkpointing: for a git cwd, _wip_start creates/checks out aq/wip/<job>
  and _wip_checkpoint commits changes on every exit path via an EXIT/INT/TERM
  trap; never commits to main/current branch; non-git cwd skipped. RESUME: a
  relaunch whose aq/wip/<job> exists checks it out first (continue from
  checkpoint). wip_base persisted in a write-once sidecar.
- retry policy (now functional): retry { max, backoff, on } requeues failures
  whose class (timeout|verify_failed|crash) is in `on`, honoring backoff via
  next_eligible (selection skips until eligible), up to max attempts; exhaustion
  -> failed/ result=retries_exhausted with the WIP branch + full log preserved.
- state integrity: all meta writes stay append-only; attempts/next_eligible/wip_*
  are re-derivable; recovery is crash-safe.

Insights (B1-B6):
- per-run metrics into meta: duration_s, exit, result, attempts, and (git cwd)
  files_changed/lines_added/lines_deleted from numstat wip_base..HEAD.
- parse_usage(engine, log) adapter: generic AQ_USAGE line + Claude/Codex token
  heuristics; Devin/Copilot TODO; usage_estimated flag; never fabricates numbers.
- status insights sub-line; new `insights [job]` command (per-job metrics or a
  recent table + per-engine token/cost/success/duration rollup).
- privacy: only metrics are recorded, never prompt content or secrets.

Backward-compatible: legacy .md and non-git cwd behave exactly as before.
2026-05-29 18:43:21 -07:00
saravanakumardb1
bc0c0e263c Merge PR #1: Phase 1 Slice 1 — evolved manifest, priority, capabilities, engine-class, idempotency
Reviewed against the diff (capability gate before launch, 3-pass idempotency,
priority+age selection, engine-class resolution, timeout/flock launch). selftest 18/18.
2026-05-29 18:12:41 -07:00
saravanakumardb1
1f18f5d7a3 docs(agent-queue): add Phase 1 Slice 3 prompt (resilience & insights, single host) 2026-05-29 18:10:43 -07:00
saravanakumardb1
beb225162a docs(agent-queue): add durability/crash-recovery (§25) + execution insights/token accounting (§26)
- §13: fleet_jobs stores instruction bodyMd (durable md SoT) + checkpoint;
  fleet_runs carries token/cost/model/tool/diff metrics
- §25: instructions durable in Cosmos md, WIP checkpoint branch aq/wip/<jobId>,
  orphan detection, resume-vs-restart, fencing, retry->dead_letter, crash taxonomy
- §26: per-run token/cost/latency/tool insights, honest metered-vs-estimated
  capture, rollups, control-plane surfacing, secret redaction
- feature-catalog rows for §25 and §26
2026-05-29 18:09:32 -07:00
saravanakumardb1
470b2ce8d0 docs(agent-queue): version Phase 1 slice prompts (slice1, slice2)
Track the delegated agent task prompts under docs/jobs/ so the slice
decomposition of the gigafactory roadmap is reproducible and reviewable.
2026-05-29 18:05:06 -07:00
saravanakumardb1
67d8aa5766 docs(agent-queue): add work hierarchy & composite delegation (roadmap/epic)
New §24 + feature-catalog row:
- two delegation modes: atomic (leaf bug/feature/task) vs composite (roadmap/epic)
- introduce job kind (leaf|composite); composite routes to a planner/orchestrator
  that fans out child leaf jobs as a DAG across factories/agents/profiles
- parentId hierarchy + rollup semantics (status/budget/verify/phase-gates) +
  idempotent re-run (skip shipped children)
- source-of-truth/sync discipline (one record referenced by many; one-way echo)
- HYBRID decision recorded: model kind/parentId/rollup in the fleet layer now,
  keep shared tracker ITEM_TYPES unchanged (label kind:roadmap), promote to a
  first-class epic type later via additive migration once proven
- phasing: leaf-only P1-P2; manual composite P3; auto-decomposition planner P3->P5
2026-05-29 18:02:10 -07:00
saravanakumardb1
a9c69b1dce docs(agent-queue): manifest field table (active vs reserved) + tick Phase 1 Slice 1 (P1-S1)
- README: new "Manifest fields (Gigafactory Phase 1)" table marking ACTIVE vs
  RESERVED, capability-grammar table, idempotency-key semantics, copilot engine
  mapping, COPILOT_BIN, and capability_mismatch/no_engine result values.
- GIGAFACTORY_ROADMAP: tick only the fully-completed P1 boxes (frontmatter
  parsing, capability detect+match, priority, backward-compat, capability
  grammar, engine-class taxonomy, idempotency-key semantics, README/progress),
  annotate partials, and bump §0 Phase 1 to in-progress 35%.
2026-05-29 17:44:37 -07:00
saravanakumardb1
4600a41e5d test(agent-queue): self-test cases for manifest/priority/capabilities/engine-class/idempotency (P1-S1)
Adds (never weakens existing) cases, each in its own temp AGENT_QUEUE_ROOT using
the no-op engine stub:
- backward-compat: legacy engine/cwd/yolo-only .md still lands in review/.
- priority: with --max 1, a critical job queued after a low job runs first
  (order-recording stub).
- capability mismatch: has:definitely-not-installed -> failed/
  result=capability_mismatch, asserting the agent was never launched.
- engine-class: agentic-coder + no engine, DEVIN_BIN stubbed -> review/.
- idempotency: same key+body twice -> 1 inbox file; same key+changed body in
  inbox -> superseded; same key+different body after drain -> rejected.

Inbox counts use find (not a globbing ls) so set -e/pipefail tolerate an empty inbox.
2026-05-29 17:44:27 -07:00
saravanakumardb1
0be5b34123 feat(agent-queue): evolved manifest, priority, capabilities, engine-class, idempotency (P1-S1)
Implements Gigafactory Phase 1 - Slice 1 in the bash runner (backward-compatible;
a legacy engine/cwd/yolo-only .md behaves exactly as before):

- Parse all new §5 manifest keys via fm_get with safe defaults; record them in
  <job>.meta and surface priority/profile/capabilities/tracker-item in `status`.
  Only priority, capabilities, engine-class and idempotency-key are functional
  this slice; the rest (profile, prefers, budget, deps, deps-mode, retry,
  review-policy, artifacts, tracker-item) are stored but inert.
- priority ordering: inbox_sorted picks critical>high>medium>low, ties by oldest;
  per-lock serialization preserved.
- capability grammar + match: detect_capabilities advertises os/engine/node/has
  tokens; caps_match honors key, key:value, key<op>version and os:any. A job whose
  declared capabilities the host cannot satisfy is moved to failed/ with
  result=capability_mismatch and the agent is never launched.
- engine-class resolution: explicit engine wins; else engine-class picks the first
  available engine honoring prefers-engine (agentic-coder->devin,claude,codex;
  chat-coder->copilot). No available engine -> result=no_engine. Adds copilot to
  the engine driver + COPILOT_BIN.
- idempotency-key dedupe on add: same key+body -> no-op; same key+different body
  supersedes an inbox prior, else is rejected with a clear error.

No change to queue/ data or the run/ship lifecycle. macOS + Linux safe.
2026-05-29 17:44:19 -07:00
saravanakumardb1
3ad9500623 docs(agent-queue): harden gigafactory roadmap after principal review
Fix correctness/distributed-systems bugs and fill gaps in place:
- atomic claim (optimistic concurrency/_etag), fencing token (leaseEpoch),
  coordinator-authoritative time added to core contract + scheduler + factory
- lease reclaim via coordinator reaper, not Cosmos TTL (TTL only GCs rows)
- split-brain/partition safety: fencing + distributed lock + quarantine
- budget: wall is the only hard ceiling; usd/tokens best-effort (provider metering)
- SSE live logs cannot use the buffering tracker proxy; use a streaming route +
  blob log storage (fleet_artifacts container)
- manifest: capability grammar, engine-class enum, idempotency 409 + deps-satisfied
  semantics, dep cycle detection
- tracker status mapping table + PR-flow ship semantics (merged+green vs pr-opened)
- station/seat capacity, factory health definition, enrollment/bootstrap auth
- Cosmos RU/indexing + claim-loop poll cost; add new sections: rollout/rollback &
  data migration (§21), capacity planning & cost (§22), ownership & RACI (§23)
- success metrics now carry provisional SLO targets; Phase 2 checklist + index synced
2026-05-29 17:15:28 -07:00
saravanakumardb1
90366e59bb docs(agent-queue): add gigafactory vision + checklist implementation roadmap
- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision
  (factory x tool x profile routing) as a checklist-driven, phased
  implementation roadmap (Phase 0-5) with acceptance criteria, verify
  gates, and a 100% Definition-of-Done rubric
- committed path: coordinator as a platform-service module + control
  plane on tracker-web, reached via a thin tracker adapter first; bash
  runner survives as the offline edge factory agent
- README: add vision/roadmap pointer
2026-05-29 17:06:32 -07:00
saravanakumardb1
7877e64f90 chore(cli): make bytelyst-cli.sh executable 2026-05-29 16:42:39 -07:00
saravanakumardb1
dde677f4b9 feat(agent-queue): interactive dashboard — navigable job list + single-key actions
Turn dash into a menu-driven control panel (single mjs script):
- numbered, arrow/j-k/1-9 selectable JOBS list (review/testing/failed/inbox)
- single-key actions wired to agent-queue.sh (single source of truth):
  p promote, s ship, x reject, u requeue (reject/requeue confirm y/n)
- enter/l opens a live log viewer; r starts a detached run loop, S stops it
- run-loop pid indicator, transient action flashes, ? help overlay
- non-TTY falls back to the read-only live view
- README: dash command + interactive key table
2026-05-29 16:19:23 -07:00