Commit Graph

279 Commits

Author SHA1 Message Date
saravanakumardb1
a849a30e11 feat(agent-queue): refuse a second run when a daemon is already active
cmd_run now checks daemon.pid liveness up front: if a run loop is alive it exits
with an error (protecting the single-launcher invariant locking depends on); a
stale daemon.pid (dead pid) is cleared and the run proceeds.
2026-05-28 22:21:31 -07:00
saravanakumardb1
11935d0539 fix(agent-queue): reserve concurrency slot before backgrounding worker
Replace live_workers with reservation-aware active_workers + shared _meta_active:
a job counts toward --max the moment its meta is written (before the worker is
backgrounded), so --max can never be exceeded. A <30s guard prevents a meta
orphaned mid-launch from pinning a slot. busy_keys now shares _meta_active.
2026-05-28 22:17:36 -07:00
saravanakumardb1
79331d591f feat(agent-queue): flag stalled workers in status + dash
Mark a running worker '⚠ stalled' when its log has not changed for more than
AGENT_QUEUE_STALL_MIN minutes (default 10), using log mtime as the freshness
signal. Implemented in both the bash status table and the Node dashboard.
2026-05-28 22:15:26 -07:00
saravanakumardb1
3b71f0117a feat(agent-queue): per-job timeout via frontmatter timeout:
Honor 'timeout: 45m' (90s|45m|2h|1d) by wrapping the agent in timeout/gtimeout
when available (hard process-tree kill), else a portable bash watchdog. On expiry
the job moves doing->failed with result=timeout and a TIMED OUT log line.
2026-05-28 22:13:50 -07:00
saravanakumardb1
f14e6c2336 feat(agent-queue): per-cwd locking so two agents never share a repo
Serialize jobs by lock key (frontmatter 'lock:' override, default cwd) via the
single run-loop's pre-launch eligibility check; the oldest non-busy job is picked
regardless of --max. Adds a flock-based worker guard where flock exists (Linux);
macOS relies on the single-daemon model. Records lock= in job meta.
2026-05-28 22:10:30 -07:00
saravanakumardb1
9b49c28af5 chore(agent-queue): add self-test harness (shellcheck + no-op run cycle) 2026-05-28 22:07:15 -07:00
saravanakumardb1
0c21a6466a feat(aliases): add aq/aqs/aqd agent-queue aliases; scope shell-ci shellcheck
- aliases/_agent.alias: aq, aqs (status), aqd (dash) — path-relative to repo
- register _agent.alias in _source_all.alias loader + document in README
- shell-ci: gate shellcheck on agent-queue.sh; bytelyst-cli.sh shellcheck is
  non-gating (pre-existing legacy SC2199 in check_collaborators), bash -n
  still gates both
2026-05-28 21:52:36 -07:00
saravanakumardb1
9c16a631e2 ci(agent-queue): add Gitea shell-ci workflow (shellcheck + syntax + smoke)
Lints agent-queue.sh + bytelyst-cli.sh (shellcheck --severity=error),
syntax-checks all scripts (bash -n) and the Node dashboard (node --check),
and runs a no-agent smoke test (init/add/drain -> failed/). Gitea runner
labels + node:20-bookworm container, path-filtered to the touched files.
2026-05-28 21:43:22 -07:00
saravanakumardb1
169e944c3c feat(agent-queue): Node live dashboard + bytelyst-cli integration
- dashboard.mjs: zero-dep Node TUI (running workers w/ engine, elapsed,
  cwd, last log line + recent done/failed); 'dash' subcommand execs it
- bytelyst-cli.sh: 'agent-queue' / 'aq' passthrough handled before the
  GITHUB_TOKEN + jq/curl gates; usage + interactive-menu entry
- README: document dash + bytelyst-cli usage
2026-05-28 21:39:25 -07:00
saravanakumardb1
8f725f8587 docs(repo-map): register agent-queue tool directory 2026-05-28 21:35:59 -07:00
saravanakumardb1
179108504f feat(agent-queue): folder-kanban runner for devin/claude/codex CLIs
Add a zero-dependency, bash 3.2-compatible queue runner that executes
prompt .md files through headless coding-agent CLIs in auto-approve mode,
moving them inbox -> doing -> done/failed with per-job logs and live status.

- pluggable engine drivers (devin --prompt-file, claude/codex via stdin)
- per-task YAML frontmatter: engine, cwd, yolo
- subcommands: init, add, run (--max N), status, watch, stop, logs
- runtime queue/ state gitignored
2026-05-28 21:35:59 -07:00
saravanakumardb1
a049e9c602 docs(roadmap): record post-roadmap follow-ups complete (v15)
- docker-lint CI propagated to all 9 remaining consumer repos
- all 10 remaining repos mirrored to Gitea; 9/9 docker-lint jobs green
- Gitea Actions runner hardened (capacity 1->2, env_file token) + documented
- repair corrupted §10 execution-log region from prior rebase
2026-05-28 18:07:36 -07:00
Hermes VM
0e1905aa33 docs: document local LLM utility workflows
Some checks failed
pre-commit / pre-commit (push) Failing after 33s
2026-05-28 00:21:06 +00:00
Hermes VM
44fd6a462a fix: bind DevOps dashboard ports to loopback
Some checks failed
pre-commit / pre-commit (push) Failing after 27s
2026-05-27 21:55:46 +00:00
Hermes VM
f936c2231c docs: record product port hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 25s
2026-05-27 21:53:08 +00:00
Hermes VM
7047d625ef feat(dashboard/vm): Phases 1.1, 1.3, 3.1, 3.4 — VM page panels
Phase 3.1 — VM Score Card (0–100):
- 6 weighted dimensions: steal time, RAM, disk, service health,
  maintenance hygiene, LLM readiness (matching roadmap scoring)
- Color-coded gauge + per-dimension progress bars with detail text
- Computed from health + cron + unhealthy data; degrades gracefully
  when any source is unavailable

Phase 1.3 — Unhealthy Container Detail Panel:
- Loads independently from GET /api/vm/containers/unhealthy
- Per-container: name, unhealthy since, restart count, last health logs
- Expandable row for health check output
- One-click restart with spinner, feedback toast, auto-refresh after 3s

Phase 1.1 — Cron Status Panel:
- Loads from GET /api/vm/cron-status
- Table: 4 managed jobs × schedule | last run | freed MB | status | next
- Collapsible run history (last 10) with step-by-step log expansion

Phase 3.4 — Ollama/LLM Panel:
- Loads from GET /api/vm/ollama/models
- Currently-loaded section with RAM pressure warning (<4 GB free)
- RAM bar visualisation showing model footprint
- Model list with size + last-used time
- One-click unload button

Other improvements:
- All data fetched in parallel (Promise.allSettled) — any panel failure
  does not block the rest of the page
- Add steal, failed_units, cron_missing_paths to CHECK_META/CHECK_ORDER
- Refresh now updates all 5 data sources atomically
- web/package-lock.json regenerated (was stale, caused build failure)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:49:23 +00:00
Hermes VM
b15c570587 docs: record common-platform port hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 37s
2026-05-27 21:32:31 +00:00
Hermes VM
d9618ba7b0 feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog
Phase 1.2 — CPU steal time metric in vm-health-check.sh:
- Samples /proc/stat twice 1s apart for accurate current steal %
- Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host)
- Inserts before memory check so steal is visible alongside load

Phase 1.4 — Swap pressure indicator:
- Reads SwapCached from /proc/meminfo as secondary metric
- Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G)
- New WARN path: SwapCached > 200MB signals recent pressure even when
  current swap usage looks ok (catches post-spike state)

Phase 2.1 — Docker health-check watchdog:
- docker-health-watchdog.sh: checks unhealthy containers every 10 min,
  restarts only after 3 consecutive failing health checks (30min grace)
- docker-health-watchdog.service + .timer: enabled, fires every 10 min
- Sends Telegram notification on each auto-restart
- Rollback: systemctl disable docker-health-watchdog.timer

Phase 2.2 already complete: sync_hermes_persistent_backup.py handles
diverge gracefully with rebase/reset-hard fallback; running successfully.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:31:09 +00:00
Hermes VM
d60c81ebda docs: record internal port loopback hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 38s
2026-05-27 21:25:38 +00:00
Hermes VM
2fc23d6baa feat(vm): fix devops-backend VM module — Phase 0.1 complete
- Switch backend runner from node:20-alpine to node:20-slim so GNU df
  flags (--output=pcent/avail) work inside the container
- Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw),
  docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars
- Rebuild repository.ts: env-configurable paths, cron history parser,
  unhealthy-container inspector, Ollama model endpoints
- Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama
  models, container restart, model unload
- vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix
  (( count++ )) → count=$(( count + 1 )) for set -e compatibility
- Add docs/VM_OBSERVABILITY_ROADMAP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:13:45 +00:00
Hermes VM
5a2d92f519 docs: record VM container health fix
Some checks failed
pre-commit / pre-commit (push) Failing after 33s
2026-05-27 21:12:45 +00:00
e2db92f3b1 Add Hermes snapshot diff view 2026-05-27 21:05:57 +00:00
8f522e3505 Add Hermes dashboard improvement backlog 2026-05-27 21:02:23 +00:00
Hermes VM
9210a8890f feat: detect stale VM automation
Some checks failed
pre-commit / pre-commit (push) Failing after 32s
2026-05-27 21:00:43 +00:00
Hermes VM
3d5f369f3d docs: record Gitea runner recovery
Some checks failed
pre-commit / pre-commit (push) Failing after 40s
2026-05-27 20:58:16 +00:00
Hermes VM
1f2eea8268 docs: record VM backup and cron fixes
Some checks failed
pre-commit / pre-commit (push) Has been cancelled
2026-05-27 20:56:11 +00:00
90f6db2014 Complete Hermes ops dashboard and roadmap 2026-05-27 20:53:58 +00:00
Hermes VM
e3d1dddf51 docs: add VM exposure inventory
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:51:27 +00:00
98a7915a38 Reconcile Hermes roadmap and dashboard status 2026-05-27 20:46:16 +00:00
ac79591903 Mark web search tooling complete 2026-05-27 20:46:16 +00:00
Hermes VM
313a775fa0 docs: strengthen VM security roadmap gates
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:34:37 +00:00
Hermes VM
2c125adb05 docs: add VM security blind spots roadmap
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:21:52 +00:00
c89018ae47 Tighten Telegram fallback wording 2026-05-27 20:18:46 +00:00
8145484136 Verify Telegram fallback platform context 2026-05-27 20:16:30 +00:00
8da66497cc Tighten Hermes local fallback chain 2026-05-27 19:58:09 +00:00
3e26f0da31 Close Hermes browser and web backend items 2026-05-27 19:23:55 +00:00
root
d1f234fc01 Mark Firecrawl as locally configured 2026-05-27 18:57:50 +00:00
Hermes VM
70d96d7684 feat: add gitea backup timer assets 2026-05-27 18:53:20 +00:00
Hermes VM
147db72330 docs: add hostinger maintenance operations entry 2026-05-27 18:53:20 +00:00
Hermes VM
31b414d62b fix: systematic bug fixes — code-quality parser, env key, config warnings, auth cleanup, deployment safety
- code-quality/repository.ts: fix tsErrorMatch[3] → [4] for type field (group 3 is column, 4 is error|warning)
- code-quality/repository.ts: fix ESLint regex to make rule brackets optional (not all formatters include them)
- code-quality/repository.ts: fix Vitest test count — parse 'Tests' line (individual tests) instead of 'Test Files' (file count); improve Jest regex to capture pass/fail independently
- env/repository.ts: replace raw process.env.ENCRYPTION_KEY with config.ENCRYPTION_KEY so the validated default flows through a single source of truth
- config.ts: add startup console.warn when CSRF_SECRET or ENCRYPTION_KEY are using insecure defaults
- deployments/orchestrator.ts: refactor runDeploymentScript to use try/catch/finally — deployment record is now always written in the finally block, preventing zombie 'running' states if updateDeployment itself throws
- auth.tsx: remove dead 'user &&' guard (user is always truthy after the !user check above); remove debug console.log calls, keep console.error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
cdc23696b2 fix: resolve all TypeScript errors — green tsc
Primitives.tsx (TS2339):
- asChild branch read children.props.className before the cast applied,
  making props typed as unknown. Extract typedChild first, then read props.

hermes/page.tsx + agents/page.tsx + tasks/page.tsx + tasks/[id]/page.tsx (TS2322):
- Badge.variant accepts 'neutral'|'success'|'warning'|'error'|'info' but
  callers were passing 'danger' (should be 'error') and 'default' (should
  be 'neutral'). MetricCard.tone is a separate type and is correct as-is.

Changes:
- statusTone map in hermes/page.tsx: 'danger' → 'error', 'default' → 'neutral'
- getTaskTone fallback: 'default' → 'neutral'; explicit return type added
- levelTone in tasks/[id]/page.tsx: 'danger' → 'error'; explicit return type added
- Inline Badge variants: all remaining 'danger' → 'error' across 3 files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
1099d518ef improve: dashboard security, code quality, and UX fixes
Security (backend):
- env/routes: add requireAdmin to all 6 env endpoints — GET /env was
  fully open, exposing all secret values to unauthenticated requests
- deployments/routes: add requireAdmin to all 4 GET endpoints (deployment
  history and logs were publicly readable)
- health/routes: remove duplicate requireAdmin call from DELETE /health/cache
  handler body (was already enforced via preHandler)

Frontend — auth/api:
- system/page: replace raw fetch + localStorage token with apiRequest
  (mutations now go through CSRF flow)
- vm/page: same — replace raw fetch with vmApi from api.ts
- api.ts: add vmApi (getHealth, getCleanupLog, runCleanup) + shared
  VmHealthResult / VmCheck / VmCheckLevel types

Shared utilities:
- utils.ts: add formatBytes() and getStatusColor() shared helpers
- system/page: remove duplicate formatBytes, import from utils
- health/page: remove duplicate getStatusColor, import from utils
- page.tsx (home): remove duplicate getStatusColor, import from utils

UX improvements:
- page.tsx: remove Seed Services button from normal header (debug tool)
- page.tsx: deploy button now always enabled; shows inline warning banner
  when service is not 'up' instead of silently disabling the button
- metrics: fix bar chart — bars now grow from bottom (flex-col-reverse),
  add empty state, fix date parsing timezone edge case
- sidebar-nav: theme toggle now functional — persists to localStorage and
  toggles document.documentElement class 'dark'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
d0b8ce2c74 feat: add VM Health page to devops dashboard
Backend (Fastify):
- New module: modules/vm/ (types, repository, routes)
- GET  /api/vm/health      — runs vm-health-check.sh --json, returns structured result
- GET  /api/vm/cleanup-log — tails /var/log/vm-cleanup.log
- POST /api/vm/cleanup     — triggers vm-cleanup.sh (weekly / monthly / dry-run)
- Registered vmRoutes in server.ts

Frontend (Next.js):
- New page: /vm — VM Health
  - Overall status banner (OK/WARN/CRIT) with issue summary
  - Per-check cards: disk, load, RAM, swap, crash loops, container health,
    build cache, docker images, journal, syslog — color-coded by level
  - Cleanup trigger buttons (dry-run, weekly, monthly) with output viewer
  - Collapsible cleanup log viewer (last 40 lines)
  - Auto-refresh every 60s
- sidebar-nav.tsx: added 'VM Health' entry with Server icon

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
678430d77d fix: cron health-check entry should call vm-health-check.sh --notify
The 07:00 daily cron was incorrectly pointing to vm-cleanup.sh instead
of vm-health-check.sh. Health check is read-only; cleanup is not.
Also add --notify so Telegram alerts fire when WARNING/CRITICAL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
0a2d303f93 add HostingerVM health-check and cleanup scripts
- vm-health-check.sh: read-only checks for disk, load, RAM, swap,
  Docker containers (crash-loops + healthchecks), build cache, journal.
  Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT.

- vm-cleanup.sh: safe periodic cleanup.
  Default (weekly): build cache, journal, apt, npm, .next/cache.
  --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup.
  --dry-run, --install-cron, --uninstall-cron.
  Logs to /var/log/vm-cleanup.log.

Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
root
4249b17afc Document Firecrawl backend selection 2026-05-27 18:52:39 +00:00
root
08f32a79e8 Clarify remaining Hermes fallback verification 2026-05-27 18:46:32 +00:00
root
8fbb535d90 Add shared local Hermes fallback chain 2026-05-27 18:43:30 +00:00
root
9ee060e839 Harden Hermes operations dashboard status 2026-05-27 17:45:41 +00:00
root
0e6528b366 Add live Hermes operations dashboard 2026-05-27 13:04:36 +00:00