Commit Graph

193 Commits

Author SHA1 Message Date
saravanakumardb1
67d8aa5766 docs(agent-queue): add work hierarchy & composite delegation (roadmap/epic)
New §24 + feature-catalog row:
- two delegation modes: atomic (leaf bug/feature/task) vs composite (roadmap/epic)
- introduce job kind (leaf|composite); composite routes to a planner/orchestrator
  that fans out child leaf jobs as a DAG across factories/agents/profiles
- parentId hierarchy + rollup semantics (status/budget/verify/phase-gates) +
  idempotent re-run (skip shipped children)
- source-of-truth/sync discipline (one record referenced by many; one-way echo)
- HYBRID decision recorded: model kind/parentId/rollup in the fleet layer now,
  keep shared tracker ITEM_TYPES unchanged (label kind:roadmap), promote to a
  first-class epic type later via additive migration once proven
- phasing: leaf-only P1-P2; manual composite P3; auto-decomposition planner P3->P5
2026-05-29 18:02:10 -07:00
saravanakumardb1
3ad9500623 docs(agent-queue): harden gigafactory roadmap after principal review
Fix correctness/distributed-systems bugs and fill gaps in place:
- atomic claim (optimistic concurrency/_etag), fencing token (leaseEpoch),
  coordinator-authoritative time added to core contract + scheduler + factory
- lease reclaim via coordinator reaper, not Cosmos TTL (TTL only GCs rows)
- split-brain/partition safety: fencing + distributed lock + quarantine
- budget: wall is the only hard ceiling; usd/tokens best-effort (provider metering)
- SSE live logs cannot use the buffering tracker proxy; use a streaming route +
  blob log storage (fleet_artifacts container)
- manifest: capability grammar, engine-class enum, idempotency 409 + deps-satisfied
  semantics, dep cycle detection
- tracker status mapping table + PR-flow ship semantics (merged+green vs pr-opened)
- station/seat capacity, factory health definition, enrollment/bootstrap auth
- Cosmos RU/indexing + claim-loop poll cost; add new sections: rollout/rollback &
  data migration (§21), capacity planning & cost (§22), ownership & RACI (§23)
- success metrics now carry provisional SLO targets; Phase 2 checklist + index synced
2026-05-29 17:15:28 -07:00
saravanakumardb1
90366e59bb docs(agent-queue): add gigafactory vision + checklist implementation roadmap
- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision
  (factory x tool x profile routing) as a checklist-driven, phased
  implementation roadmap (Phase 0-5) with acceptance criteria, verify
  gates, and a 100% Definition-of-Done rubric
- committed path: coordinator as a platform-service module + control
  plane on tracker-web, reached via a thin tracker adapter first; bash
  runner survives as the offline edge factory agent
- README: add vision/roadmap pointer
2026-05-29 17:06:32 -07:00
saravanakumardb1
7877e64f90 chore(cli): make bytelyst-cli.sh executable 2026-05-29 16:42:39 -07:00
saravanakumardb1
dde677f4b9 feat(agent-queue): interactive dashboard — navigable job list + single-key actions
Turn dash into a menu-driven control panel (single mjs script):
- numbered, arrow/j-k/1-9 selectable JOBS list (review/testing/failed/inbox)
- single-key actions wired to agent-queue.sh (single source of truth):
  p promote, s ship, x reject, u requeue (reject/requeue confirm y/n)
- enter/l opens a live log viewer; r starts a detached run loop, S stops it
- run-loop pid indicator, transient action flashes, ? help overlay
- non-TTY falls back to the read-only live view
- README: dash command + interactive key table
2026-05-29 16:19:23 -07:00
saravanakumardb1
4ed4d75a67 feat(agent-queue): default max concurrency 2->3 (still env/flag configurable)
- AGENT_QUEUE_MAX default 3 (override via env or run --max N)
- sync README quick-start + env table + bytelyst-cli example to --max 3
2026-05-29 16:09:12 -07:00
Saravanakumar D
58773ac108 feat(devops): add interactive WSL CLI installer script
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 16:05:01 -07:00
saravanakumardb1
af1bc6904e feat(agent-queue): build/ship lifecycle with auto-QA verify gate + manual ship
Redesign the kanban runner stages from inbox->doing->done/failed to
inbox->building->review->testing->shipped (+ failed):

- worker: agent rc=0 lands in review/, then runs the configurable
  verify command (frontmatter verify: / AGENT_QUEUE_VERIFY) in cwd;
  pass -> testing/ (QA), fail -> failed/, none -> parks in review/
- new commands: ship (testing->shipped, manual gate), promote
  (advance one stage), reject (review/testing->failed); requeue now
  also pulls from review/testing
- status + dashboard.mjs render all six stages; RECENT panel labels
  shipped/testing/review/verify_failed/timeout/rejected
- README: new lifecycle diagram, verify: frontmatter, result= glossary,
  command table + folder layout
- selftest: assert no-verify->review, verify-pass->testing->ship->shipped,
  verify-fail->failed
- rename queue/doing->building, queue/done->review; add testing/ shipped/
2026-05-29 16:03:01 -07:00
saravanakumardb1
27feba36fa fix(agent-queue): status used undefined live_workers; call active_workers 2026-05-29 15:27:15 -07:00
saravanakumardb1
2f6aea07e0 chore(agent-queue): track queue/ in repo + seed inbox with nomgap/localmemgpt/devintelli jobs 2026-05-29 15:10:00 -07:00
saravanakumardb1
c52c165fd6 docs(agent-queue): document locking, timeout, stall, requeue/clean
Update README command table (requeue/clean, stall marker, single-run note),
frontmatter (lock/timeout), engine mapping (stdin), config (STALL_MIN,
FLOCK_BIN/TIMEOUT_BIN), folder layout (locks/.archive), Safety (automatic
same-repo serialization + portability notes), and mark roadmap items done.
2026-05-28 22:33:20 -07:00
saravanakumardb1
1f15520c4f feat(agent-queue): add requeue and clean commands
- requeue <job>: move a failed job back to inbox/ and drop stale meta/body so
  it re-runs cleanly
- clean [--keep N]: archive finished jobs' logs+meta beyond the newest N
  (default 50) into queue/.archive/<ts>/; running jobs + .md records untouched
- document both in usage + bytelyst-cli subcommand list
2026-05-28 22:31:56 -07:00
saravanakumardb1
76104bda84 fix(cli): harden bytelyst-cli env loading, pagination, and HTTP checks
- .env via 'set -a; . ./.env; set +a' (handles quoted values/spaces safely)
- printf for the GITHUB_TOKEN message so the newline renders
- gh_get_all: paginate all pages (per_page=100) and verify HTTP 200 before jq;
  rewire list-public/list-private/check-collaborators through it
- fix SC2199 whitelist membership (explicit loop, no substring false-matches)
- shell-ci: gate shellcheck on bytelyst-cli + run agent-queue self-test
2026-05-28 22:30:08 -07:00
saravanakumardb1
4239648876 fix(agent-queue): verify pid start time to defeat pid reuse
Record pidstart (ps lstart) at launch and verify it in all liveness checks
(_meta_active, status, stop) via _pid_alive, so a recycled pid can never be
mistaken for our worker. Falls back to plain liveness when no start time recorded.
2026-05-28 22:24:50 -07:00
saravanakumardb1
a849a30e11 feat(agent-queue): refuse a second run when a daemon is already active
cmd_run now checks daemon.pid liveness up front: if a run loop is alive it exits
with an error (protecting the single-launcher invariant locking depends on); a
stale daemon.pid (dead pid) is cleared and the run proceeds.
2026-05-28 22:21:31 -07:00
saravanakumardb1
11935d0539 fix(agent-queue): reserve concurrency slot before backgrounding worker
Replace live_workers with reservation-aware active_workers + shared _meta_active:
a job counts toward --max the moment its meta is written (before the worker is
backgrounded), so --max can never be exceeded. A <30s guard prevents a meta
orphaned mid-launch from pinning a slot. busy_keys now shares _meta_active.
2026-05-28 22:17:36 -07:00
saravanakumardb1
79331d591f feat(agent-queue): flag stalled workers in status + dash
Mark a running worker '⚠ stalled' when its log has not changed for more than
AGENT_QUEUE_STALL_MIN minutes (default 10), using log mtime as the freshness
signal. Implemented in both the bash status table and the Node dashboard.
2026-05-28 22:15:26 -07:00
saravanakumardb1
3b71f0117a feat(agent-queue): per-job timeout via frontmatter timeout:
Honor 'timeout: 45m' (90s|45m|2h|1d) by wrapping the agent in timeout/gtimeout
when available (hard process-tree kill), else a portable bash watchdog. On expiry
the job moves doing->failed with result=timeout and a TIMED OUT log line.
2026-05-28 22:13:50 -07:00
saravanakumardb1
f14e6c2336 feat(agent-queue): per-cwd locking so two agents never share a repo
Serialize jobs by lock key (frontmatter 'lock:' override, default cwd) via the
single run-loop's pre-launch eligibility check; the oldest non-busy job is picked
regardless of --max. Adds a flock-based worker guard where flock exists (Linux);
macOS relies on the single-daemon model. Records lock= in job meta.
2026-05-28 22:10:30 -07:00
saravanakumardb1
9b49c28af5 chore(agent-queue): add self-test harness (shellcheck + no-op run cycle) 2026-05-28 22:07:15 -07:00
saravanakumardb1
0c21a6466a feat(aliases): add aq/aqs/aqd agent-queue aliases; scope shell-ci shellcheck
- aliases/_agent.alias: aq, aqs (status), aqd (dash) — path-relative to repo
- register _agent.alias in _source_all.alias loader + document in README
- shell-ci: gate shellcheck on agent-queue.sh; bytelyst-cli.sh shellcheck is
  non-gating (pre-existing legacy SC2199 in check_collaborators), bash -n
  still gates both
2026-05-28 21:52:36 -07:00
saravanakumardb1
9c16a631e2 ci(agent-queue): add Gitea shell-ci workflow (shellcheck + syntax + smoke)
Lints agent-queue.sh + bytelyst-cli.sh (shellcheck --severity=error),
syntax-checks all scripts (bash -n) and the Node dashboard (node --check),
and runs a no-agent smoke test (init/add/drain -> failed/). Gitea runner
labels + node:20-bookworm container, path-filtered to the touched files.
2026-05-28 21:43:22 -07:00
saravanakumardb1
169e944c3c feat(agent-queue): Node live dashboard + bytelyst-cli integration
- dashboard.mjs: zero-dep Node TUI (running workers w/ engine, elapsed,
  cwd, last log line + recent done/failed); 'dash' subcommand execs it
- bytelyst-cli.sh: 'agent-queue' / 'aq' passthrough handled before the
  GITHUB_TOKEN + jq/curl gates; usage + interactive-menu entry
- README: document dash + bytelyst-cli usage
2026-05-28 21:39:25 -07:00
saravanakumardb1
8f725f8587 docs(repo-map): register agent-queue tool directory 2026-05-28 21:35:59 -07:00
saravanakumardb1
179108504f feat(agent-queue): folder-kanban runner for devin/claude/codex CLIs
Add a zero-dependency, bash 3.2-compatible queue runner that executes
prompt .md files through headless coding-agent CLIs in auto-approve mode,
moving them inbox -> doing -> done/failed with per-job logs and live status.

- pluggable engine drivers (devin --prompt-file, claude/codex via stdin)
- per-task YAML frontmatter: engine, cwd, yolo
- subcommands: init, add, run (--max N), status, watch, stop, logs
- runtime queue/ state gitignored
2026-05-28 21:35:59 -07:00
saravanakumardb1
a049e9c602 docs(roadmap): record post-roadmap follow-ups complete (v15)
- docker-lint CI propagated to all 9 remaining consumer repos
- all 10 remaining repos mirrored to Gitea; 9/9 docker-lint jobs green
- Gitea Actions runner hardened (capacity 1->2, env_file token) + documented
- repair corrupted §10 execution-log region from prior rebase
2026-05-28 18:07:36 -07:00
Hermes VM
0e1905aa33 docs: document local LLM utility workflows
Some checks failed
pre-commit / pre-commit (push) Failing after 33s
2026-05-28 00:21:06 +00:00
Hermes VM
44fd6a462a fix: bind DevOps dashboard ports to loopback
Some checks failed
pre-commit / pre-commit (push) Failing after 27s
2026-05-27 21:55:46 +00:00
Hermes VM
f936c2231c docs: record product port hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 25s
2026-05-27 21:53:08 +00:00
Hermes VM
7047d625ef feat(dashboard/vm): Phases 1.1, 1.3, 3.1, 3.4 — VM page panels
Phase 3.1 — VM Score Card (0–100):
- 6 weighted dimensions: steal time, RAM, disk, service health,
  maintenance hygiene, LLM readiness (matching roadmap scoring)
- Color-coded gauge + per-dimension progress bars with detail text
- Computed from health + cron + unhealthy data; degrades gracefully
  when any source is unavailable

Phase 1.3 — Unhealthy Container Detail Panel:
- Loads independently from GET /api/vm/containers/unhealthy
- Per-container: name, unhealthy since, restart count, last health logs
- Expandable row for health check output
- One-click restart with spinner, feedback toast, auto-refresh after 3s

Phase 1.1 — Cron Status Panel:
- Loads from GET /api/vm/cron-status
- Table: 4 managed jobs × schedule | last run | freed MB | status | next
- Collapsible run history (last 10) with step-by-step log expansion

Phase 3.4 — Ollama/LLM Panel:
- Loads from GET /api/vm/ollama/models
- Currently-loaded section with RAM pressure warning (<4 GB free)
- RAM bar visualisation showing model footprint
- Model list with size + last-used time
- One-click unload button

Other improvements:
- All data fetched in parallel (Promise.allSettled) — any panel failure
  does not block the rest of the page
- Add steal, failed_units, cron_missing_paths to CHECK_META/CHECK_ORDER
- Refresh now updates all 5 data sources atomically
- web/package-lock.json regenerated (was stale, caused build failure)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:49:23 +00:00
Hermes VM
b15c570587 docs: record common-platform port hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 37s
2026-05-27 21:32:31 +00:00
Hermes VM
d9618ba7b0 feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog
Phase 1.2 — CPU steal time metric in vm-health-check.sh:
- Samples /proc/stat twice 1s apart for accurate current steal %
- Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host)
- Inserts before memory check so steal is visible alongside load

Phase 1.4 — Swap pressure indicator:
- Reads SwapCached from /proc/meminfo as secondary metric
- Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G)
- New WARN path: SwapCached > 200MB signals recent pressure even when
  current swap usage looks ok (catches post-spike state)

Phase 2.1 — Docker health-check watchdog:
- docker-health-watchdog.sh: checks unhealthy containers every 10 min,
  restarts only after 3 consecutive failing health checks (30min grace)
- docker-health-watchdog.service + .timer: enabled, fires every 10 min
- Sends Telegram notification on each auto-restart
- Rollback: systemctl disable docker-health-watchdog.timer

Phase 2.2 already complete: sync_hermes_persistent_backup.py handles
diverge gracefully with rebase/reset-hard fallback; running successfully.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:31:09 +00:00
Hermes VM
d60c81ebda docs: record internal port loopback hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 38s
2026-05-27 21:25:38 +00:00
Hermes VM
2fc23d6baa feat(vm): fix devops-backend VM module — Phase 0.1 complete
- Switch backend runner from node:20-alpine to node:20-slim so GNU df
  flags (--output=pcent/avail) work inside the container
- Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw),
  docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars
- Rebuild repository.ts: env-configurable paths, cron history parser,
  unhealthy-container inspector, Ollama model endpoints
- Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama
  models, container restart, model unload
- vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix
  (( count++ )) → count=$(( count + 1 )) for set -e compatibility
- Add docs/VM_OBSERVABILITY_ROADMAP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:13:45 +00:00
Hermes VM
5a2d92f519 docs: record VM container health fix
Some checks failed
pre-commit / pre-commit (push) Failing after 33s
2026-05-27 21:12:45 +00:00
e2db92f3b1 Add Hermes snapshot diff view 2026-05-27 21:05:57 +00:00
8f522e3505 Add Hermes dashboard improvement backlog 2026-05-27 21:02:23 +00:00
Hermes VM
9210a8890f feat: detect stale VM automation
Some checks failed
pre-commit / pre-commit (push) Failing after 32s
2026-05-27 21:00:43 +00:00
Hermes VM
3d5f369f3d docs: record Gitea runner recovery
Some checks failed
pre-commit / pre-commit (push) Failing after 40s
2026-05-27 20:58:16 +00:00
Hermes VM
1f2eea8268 docs: record VM backup and cron fixes
Some checks failed
pre-commit / pre-commit (push) Has been cancelled
2026-05-27 20:56:11 +00:00
90f6db2014 Complete Hermes ops dashboard and roadmap 2026-05-27 20:53:58 +00:00
Hermes VM
e3d1dddf51 docs: add VM exposure inventory
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:51:27 +00:00
98a7915a38 Reconcile Hermes roadmap and dashboard status 2026-05-27 20:46:16 +00:00
ac79591903 Mark web search tooling complete 2026-05-27 20:46:16 +00:00
Hermes VM
313a775fa0 docs: strengthen VM security roadmap gates
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:34:37 +00:00
Hermes VM
2c125adb05 docs: add VM security blind spots roadmap
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:21:52 +00:00
c89018ae47 Tighten Telegram fallback wording 2026-05-27 20:18:46 +00:00
8145484136 Verify Telegram fallback platform context 2026-05-27 20:16:30 +00:00
8da66497cc Tighten Hermes local fallback chain 2026-05-27 19:58:09 +00:00
3e26f0da31 Close Hermes browser and web backend items 2026-05-27 19:23:55 +00:00