bytelyst-devops-tools

Author	SHA1	Message	Date
saravanakumardb1	67d8aa5766	docs(agent-queue): add work hierarchy & composite delegation (roadmap/epic) New §24 + feature-catalog row: - two delegation modes: atomic (leaf bug/feature/task) vs composite (roadmap/epic) - introduce job kind (leaf\|composite); composite routes to a planner/orchestrator that fans out child leaf jobs as a DAG across factories/agents/profiles - parentId hierarchy + rollup semantics (status/budget/verify/phase-gates) + idempotent re-run (skip shipped children) - source-of-truth/sync discipline (one record referenced by many; one-way echo) - HYBRID decision recorded: model kind/parentId/rollup in the fleet layer now, keep shared tracker ITEM_TYPES unchanged (label kind:roadmap), promote to a first-class epic type later via additive migration once proven - phasing: leaf-only P1-P2; manual composite P3; auto-decomposition planner P3->P5	2026-05-29 18:02:10 -07:00
saravanakumardb1	a9c69b1dce	docs(agent-queue): manifest field table (active vs reserved) + tick Phase 1 Slice 1 (P1-S1) - README: new "Manifest fields (Gigafactory Phase 1)" table marking ACTIVE vs RESERVED, capability-grammar table, idempotency-key semantics, copilot engine mapping, COPILOT_BIN, and capability_mismatch/no_engine result values. - GIGAFACTORY_ROADMAP: tick only the fully-completed P1 boxes (frontmatter parsing, capability detect+match, priority, backward-compat, capability grammar, engine-class taxonomy, idempotency-key semantics, README/progress), annotate partials, and bump §0 Phase 1 to in-progress 35%.	2026-05-29 17:44:37 -07:00
saravanakumardb1	4600a41e5d	test(agent-queue): self-test cases for manifest/priority/capabilities/engine-class/idempotency (P1-S1) Adds (never weakens existing) cases, each in its own temp AGENT_QUEUE_ROOT using the no-op engine stub: - backward-compat: legacy engine/cwd/yolo-only .md still lands in review/. - priority: with --max 1, a critical job queued after a low job runs first (order-recording stub). - capability mismatch: has:definitely-not-installed -> failed/ result=capability_mismatch, asserting the agent was never launched. - engine-class: agentic-coder + no engine, DEVIN_BIN stubbed -> review/. - idempotency: same key+body twice -> 1 inbox file; same key+changed body in inbox -> superseded; same key+different body after drain -> rejected. Inbox counts use find (not a globbing ls) so set -e/pipefail tolerate an empty inbox.	2026-05-29 17:44:27 -07:00
saravanakumardb1	0be5b34123	feat(agent-queue): evolved manifest, priority, capabilities, engine-class, idempotency (P1-S1) Implements Gigafactory Phase 1 - Slice 1 in the bash runner (backward-compatible; a legacy engine/cwd/yolo-only .md behaves exactly as before): - Parse all new §5 manifest keys via fm_get with safe defaults; record them in <job>.meta and surface priority/profile/capabilities/tracker-item in `status`. Only priority, capabilities, engine-class and idempotency-key are functional this slice; the rest (profile, prefers, budget, deps, deps-mode, retry, review-policy, artifacts, tracker-item) are stored but inert. - priority ordering: inbox_sorted picks critical>high>medium>low, ties by oldest; per-lock serialization preserved. - capability grammar + match: detect_capabilities advertises os/engine/node/has tokens; caps_match honors key, key:value, key<op>version and os:any. A job whose declared capabilities the host cannot satisfy is moved to failed/ with result=capability_mismatch and the agent is never launched. - engine-class resolution: explicit engine wins; else engine-class picks the first available engine honoring prefers-engine (agentic-coder->devin,claude,codex; chat-coder->copilot). No available engine -> result=no_engine. Adds copilot to the engine driver + COPILOT_BIN. - idempotency-key dedupe on add: same key+body -> no-op; same key+different body supersedes an inbox prior, else is rejected with a clear error. No change to queue/ data or the run/ship lifecycle. macOS + Linux safe.	2026-05-29 17:44:19 -07:00
saravanakumardb1	3ad9500623	docs(agent-queue): harden gigafactory roadmap after principal review Fix correctness/distributed-systems bugs and fill gaps in place: - atomic claim (optimistic concurrency/_etag), fencing token (leaseEpoch), coordinator-authoritative time added to core contract + scheduler + factory - lease reclaim via coordinator reaper, not Cosmos TTL (TTL only GCs rows) - split-brain/partition safety: fencing + distributed lock + quarantine - budget: wall is the only hard ceiling; usd/tokens best-effort (provider metering) - SSE live logs cannot use the buffering tracker proxy; use a streaming route + blob log storage (fleet_artifacts container) - manifest: capability grammar, engine-class enum, idempotency 409 + deps-satisfied semantics, dep cycle detection - tracker status mapping table + PR-flow ship semantics (merged+green vs pr-opened) - station/seat capacity, factory health definition, enrollment/bootstrap auth - Cosmos RU/indexing + claim-loop poll cost; add new sections: rollout/rollback & data migration (§21), capacity planning & cost (§22), ownership & RACI (§23) - success metrics now carry provisional SLO targets; Phase 2 checklist + index synced	2026-05-29 17:15:28 -07:00
saravanakumardb1	90366e59bb	docs(agent-queue): add gigafactory vision + checklist implementation roadmap - docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision (factory x tool x profile routing) as a checklist-driven, phased implementation roadmap (Phase 0-5) with acceptance criteria, verify gates, and a 100% Definition-of-Done rubric - committed path: coordinator as a platform-service module + control plane on tracker-web, reached via a thin tracker adapter first; bash runner survives as the offline edge factory agent - README: add vision/roadmap pointer	2026-05-29 17:06:32 -07:00
saravanakumardb1	7877e64f90	chore(cli): make bytelyst-cli.sh executable	2026-05-29 16:42:39 -07:00
saravanakumardb1	dde677f4b9	feat(agent-queue): interactive dashboard — navigable job list + single-key actions Turn dash into a menu-driven control panel (single mjs script): - numbered, arrow/j-k/1-9 selectable JOBS list (review/testing/failed/inbox) - single-key actions wired to agent-queue.sh (single source of truth): p promote, s ship, x reject, u requeue (reject/requeue confirm y/n) - enter/l opens a live log viewer; r starts a detached run loop, S stops it - run-loop pid indicator, transient action flashes, ? help overlay - non-TTY falls back to the read-only live view - README: dash command + interactive key table	2026-05-29 16:19:23 -07:00
saravanakumardb1	4ed4d75a67	feat(agent-queue): default max concurrency 2->3 (still env/flag configurable) - AGENT_QUEUE_MAX default 3 (override via env or run --max N) - sync README quick-start + env table + bytelyst-cli example to --max 3	2026-05-29 16:09:12 -07:00
Saravanakumar D	58773ac108	feat(devops): add interactive WSL CLI installer script Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-29 16:05:01 -07:00
saravanakumardb1	af1bc6904e	feat(agent-queue): build/ship lifecycle with auto-QA verify gate + manual ship Redesign the kanban runner stages from inbox->doing->done/failed to inbox->building->review->testing->shipped (+ failed): - worker: agent rc=0 lands in review/, then runs the configurable verify command (frontmatter verify: / AGENT_QUEUE_VERIFY) in cwd; pass -> testing/ (QA), fail -> failed/, none -> parks in review/ - new commands: ship (testing->shipped, manual gate), promote (advance one stage), reject (review/testing->failed); requeue now also pulls from review/testing - status + dashboard.mjs render all six stages; RECENT panel labels shipped/testing/review/verify_failed/timeout/rejected - README: new lifecycle diagram, verify: frontmatter, result= glossary, command table + folder layout - selftest: assert no-verify->review, verify-pass->testing->ship->shipped, verify-fail->failed - rename queue/doing->building, queue/done->review; add testing/ shipped/	2026-05-29 16:03:01 -07:00
saravanakumardb1	27feba36fa	fix(agent-queue): status used undefined live_workers; call active_workers	2026-05-29 15:27:15 -07:00
saravanakumardb1	2f6aea07e0	chore(agent-queue): track queue/ in repo + seed inbox with nomgap/localmemgpt/devintelli jobs	2026-05-29 15:10:00 -07:00
saravanakumardb1	c52c165fd6	docs(agent-queue): document locking, timeout, stall, requeue/clean Update README command table (requeue/clean, stall marker, single-run note), frontmatter (lock/timeout), engine mapping (stdin), config (STALL_MIN, FLOCK_BIN/TIMEOUT_BIN), folder layout (locks/.archive), Safety (automatic same-repo serialization + portability notes), and mark roadmap items done.	2026-05-28 22:33:20 -07:00
saravanakumardb1	1f15520c4f	feat(agent-queue): add requeue and clean commands - requeue <job>: move a failed job back to inbox/ and drop stale meta/body so it re-runs cleanly - clean [--keep N]: archive finished jobs' logs+meta beyond the newest N (default 50) into queue/.archive/<ts>/; running jobs + .md records untouched - document both in usage + bytelyst-cli subcommand list	2026-05-28 22:31:56 -07:00
saravanakumardb1	76104bda84	fix(cli): harden bytelyst-cli env loading, pagination, and HTTP checks - .env via 'set -a; . ./.env; set +a' (handles quoted values/spaces safely) - printf for the GITHUB_TOKEN message so the newline renders - gh_get_all: paginate all pages (per_page=100) and verify HTTP 200 before jq; rewire list-public/list-private/check-collaborators through it - fix SC2199 whitelist membership (explicit loop, no substring false-matches) - shell-ci: gate shellcheck on bytelyst-cli + run agent-queue self-test	2026-05-28 22:30:08 -07:00
saravanakumardb1	4239648876	fix(agent-queue): verify pid start time to defeat pid reuse Record pidstart (ps lstart) at launch and verify it in all liveness checks (_meta_active, status, stop) via _pid_alive, so a recycled pid can never be mistaken for our worker. Falls back to plain liveness when no start time recorded.	2026-05-28 22:24:50 -07:00
saravanakumardb1	a849a30e11	feat(agent-queue): refuse a second run when a daemon is already active cmd_run now checks daemon.pid liveness up front: if a run loop is alive it exits with an error (protecting the single-launcher invariant locking depends on); a stale daemon.pid (dead pid) is cleared and the run proceeds.	2026-05-28 22:21:31 -07:00
saravanakumardb1	11935d0539	fix(agent-queue): reserve concurrency slot before backgrounding worker Replace live_workers with reservation-aware active_workers + shared _meta_active: a job counts toward --max the moment its meta is written (before the worker is backgrounded), so --max can never be exceeded. A <30s guard prevents a meta orphaned mid-launch from pinning a slot. busy_keys now shares _meta_active.	2026-05-28 22:17:36 -07:00
saravanakumardb1	79331d591f	feat(agent-queue): flag stalled workers in status + dash Mark a running worker '⚠ stalled' when its log has not changed for more than AGENT_QUEUE_STALL_MIN minutes (default 10), using log mtime as the freshness signal. Implemented in both the bash status table and the Node dashboard.	2026-05-28 22:15:26 -07:00
saravanakumardb1	3b71f0117a	feat(agent-queue): per-job timeout via frontmatter timeout: Honor 'timeout: 45m' (90s\|45m\|2h\|1d) by wrapping the agent in timeout/gtimeout when available (hard process-tree kill), else a portable bash watchdog. On expiry the job moves doing->failed with result=timeout and a TIMED OUT log line.	2026-05-28 22:13:50 -07:00
saravanakumardb1	f14e6c2336	feat(agent-queue): per-cwd locking so two agents never share a repo Serialize jobs by lock key (frontmatter 'lock:' override, default cwd) via the single run-loop's pre-launch eligibility check; the oldest non-busy job is picked regardless of --max. Adds a flock-based worker guard where flock exists (Linux); macOS relies on the single-daemon model. Records lock= in job meta.	2026-05-28 22:10:30 -07:00
saravanakumardb1	9b49c28af5	chore(agent-queue): add self-test harness (shellcheck + no-op run cycle)	2026-05-28 22:07:15 -07:00
saravanakumardb1	0c21a6466a	feat(aliases): add aq/aqs/aqd agent-queue aliases; scope shell-ci shellcheck - aliases/_agent.alias: aq, aqs (status), aqd (dash) — path-relative to repo - register _agent.alias in _source_all.alias loader + document in README - shell-ci: gate shellcheck on agent-queue.sh; bytelyst-cli.sh shellcheck is non-gating (pre-existing legacy SC2199 in check_collaborators), bash -n still gates both	2026-05-28 21:52:36 -07:00
saravanakumardb1	9c16a631e2	ci(agent-queue): add Gitea shell-ci workflow (shellcheck + syntax + smoke) Lints agent-queue.sh + bytelyst-cli.sh (shellcheck --severity=error), syntax-checks all scripts (bash -n) and the Node dashboard (node --check), and runs a no-agent smoke test (init/add/drain -> failed/). Gitea runner labels + node:20-bookworm container, path-filtered to the touched files.	2026-05-28 21:43:22 -07:00
saravanakumardb1	169e944c3c	feat(agent-queue): Node live dashboard + bytelyst-cli integration - dashboard.mjs: zero-dep Node TUI (running workers w/ engine, elapsed, cwd, last log line + recent done/failed); 'dash' subcommand execs it - bytelyst-cli.sh: 'agent-queue' / 'aq' passthrough handled before the GITHUB_TOKEN + jq/curl gates; usage + interactive-menu entry - README: document dash + bytelyst-cli usage	2026-05-28 21:39:25 -07:00
saravanakumardb1	8f725f8587	docs(repo-map): register agent-queue tool directory	2026-05-28 21:35:59 -07:00
saravanakumardb1	179108504f	feat(agent-queue): folder-kanban runner for devin/claude/codex CLIs Add a zero-dependency, bash 3.2-compatible queue runner that executes prompt .md files through headless coding-agent CLIs in auto-approve mode, moving them inbox -> doing -> done/failed with per-job logs and live status. - pluggable engine drivers (devin --prompt-file, claude/codex via stdin) - per-task YAML frontmatter: engine, cwd, yolo - subcommands: init, add, run (--max N), status, watch, stop, logs - runtime queue/ state gitignored	2026-05-28 21:35:59 -07:00
saravanakumardb1	a049e9c602	docs(roadmap): record post-roadmap follow-ups complete (v15) - docker-lint CI propagated to all 9 remaining consumer repos - all 10 remaining repos mirrored to Gitea; 9/9 docker-lint jobs green - Gitea Actions runner hardened (capacity 1->2, env_file token) + documented - repair corrupted §10 execution-log region from prior rebase	2026-05-28 18:07:36 -07:00
Hermes VM	0e1905aa33	docs: document local LLM utility workflows Some checks failed pre-commit / pre-commit (push) Failing after 33s Details	2026-05-28 00:21:06 +00:00
Hermes VM	44fd6a462a	fix: bind DevOps dashboard ports to loopback Some checks failed pre-commit / pre-commit (push) Failing after 27s Details	2026-05-27 21:55:46 +00:00
Hermes VM	f936c2231c	docs: record product port hardening Some checks failed pre-commit / pre-commit (push) Failing after 25s Details	2026-05-27 21:53:08 +00:00
Hermes VM	7047d625ef	feat(dashboard/vm): Phases 1.1, 1.3, 3.1, 3.4 — VM page panels Phase 3.1 — VM Score Card (0–100): - 6 weighted dimensions: steal time, RAM, disk, service health, maintenance hygiene, LLM readiness (matching roadmap scoring) - Color-coded gauge + per-dimension progress bars with detail text - Computed from health + cron + unhealthy data; degrades gracefully when any source is unavailable Phase 1.3 — Unhealthy Container Detail Panel: - Loads independently from GET /api/vm/containers/unhealthy - Per-container: name, unhealthy since, restart count, last health logs - Expandable row for health check output - One-click restart with spinner, feedback toast, auto-refresh after 3s Phase 1.1 — Cron Status Panel: - Loads from GET /api/vm/cron-status - Table: 4 managed jobs × schedule \| last run \| freed MB \| status \| next - Collapsible run history (last 10) with step-by-step log expansion Phase 3.4 — Ollama/LLM Panel: - Loads from GET /api/vm/ollama/models - Currently-loaded section with RAM pressure warning (<4 GB free) - RAM bar visualisation showing model footprint - Model list with size + last-used time - One-click unload button Other improvements: - All data fetched in parallel (Promise.allSettled) — any panel failure does not block the rest of the page - Add steal, failed_units, cron_missing_paths to CHECK_META/CHECK_ORDER - Refresh now updates all 5 data sources atomically - web/package-lock.json regenerated (was stale, caused build failure) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 21:49:23 +00:00
Hermes VM	b15c570587	docs: record common-platform port hardening Some checks failed pre-commit / pre-commit (push) Failing after 37s Details	2026-05-27 21:32:31 +00:00
Hermes VM	d9618ba7b0	feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog Phase 1.2 — CPU steal time metric in vm-health-check.sh: - Samples /proc/stat twice 1s apart for accurate current steal % - Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host) - Inserts before memory check so steal is visible alongside load Phase 1.4 — Swap pressure indicator: - Reads SwapCached from /proc/meminfo as secondary metric - Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G) - New WARN path: SwapCached > 200MB signals recent pressure even when current swap usage looks ok (catches post-spike state) Phase 2.1 — Docker health-check watchdog: - docker-health-watchdog.sh: checks unhealthy containers every 10 min, restarts only after 3 consecutive failing health checks (30min grace) - docker-health-watchdog.service + .timer: enabled, fires every 10 min - Sends Telegram notification on each auto-restart - Rollback: systemctl disable docker-health-watchdog.timer Phase 2.2 already complete: sync_hermes_persistent_backup.py handles diverge gracefully with rebase/reset-hard fallback; running successfully. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 21:31:09 +00:00
Hermes VM	d60c81ebda	docs: record internal port loopback hardening Some checks failed pre-commit / pre-commit (push) Failing after 38s Details	2026-05-27 21:25:38 +00:00
Hermes VM	2fc23d6baa	feat(vm): fix devops-backend VM module — Phase 0.1 complete - Switch backend runner from node:20-alpine to node:20-slim so GNU df flags (--output=pcent/avail) work inside the container - Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw), docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars - Rebuild repository.ts: env-configurable paths, cron history parser, unhealthy-container inspector, Ollama model endpoints - Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama models, container restart, model unload - vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix (( count++ )) → count=$(( count + 1 )) for set -e compatibility - Add docs/VM_OBSERVABILITY_ROADMAP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 21:13:45 +00:00
Hermes VM	5a2d92f519	docs: record VM container health fix Some checks failed pre-commit / pre-commit (push) Failing after 33s Details	2026-05-27 21:12:45 +00:00
Saravana Kumar	e2db92f3b1	Add Hermes snapshot diff view	2026-05-27 21:05:57 +00:00
Saravana Kumar	8f522e3505	Add Hermes dashboard improvement backlog	2026-05-27 21:02:23 +00:00
Hermes VM	9210a8890f	feat: detect stale VM automation Some checks failed pre-commit / pre-commit (push) Failing after 32s Details	2026-05-27 21:00:43 +00:00
Hermes VM	3d5f369f3d	docs: record Gitea runner recovery Some checks failed pre-commit / pre-commit (push) Failing after 40s Details	2026-05-27 20:58:16 +00:00
Hermes VM	1f2eea8268	docs: record VM backup and cron fixes Some checks failed pre-commit / pre-commit (push) Has been cancelled Details	2026-05-27 20:56:11 +00:00
Saravana Kumar	90f6db2014	Complete Hermes ops dashboard and roadmap	2026-05-27 20:53:58 +00:00
Hermes VM	e3d1dddf51	docs: add VM exposure inventory Some checks are pending pre-commit / pre-commit (push) Waiting to run Details	2026-05-27 20:51:27 +00:00
Saravana Kumar	98a7915a38	Reconcile Hermes roadmap and dashboard status	2026-05-27 20:46:16 +00:00
Saravana Kumar	ac79591903	Mark web search tooling complete	2026-05-27 20:46:16 +00:00
Hermes VM	313a775fa0	docs: strengthen VM security roadmap gates Some checks are pending pre-commit / pre-commit (push) Waiting to run Details	2026-05-27 20:34:37 +00:00
Hermes VM	2c125adb05	docs: add VM security blind spots roadmap Some checks are pending pre-commit / pre-commit (push) Waiting to run Details	2026-05-27 20:21:52 +00:00
Saravana Kumar	c89018ae47	Tighten Telegram fallback wording	2026-05-27 20:18:46 +00:00

1 2 3 4 5 ...

296 Commits