From 8de72351dec9f5674470e8d2f329701945c20252 Mon Sep 17 00:00:00 2001 From: root Date: Wed, 27 May 2026 10:45:20 +0000 Subject: [PATCH] Complete Hermes dashboard and watchdog roadmap audit --- docs/hermes-operations.md | 47 ++++++++++++++ docs/hermes-setup-upgrade-roadmap.md | 90 +++++++++++++++++++-------- scripts/hermes-health-watchdog.py | 53 +++++++++++++++- systemd/hermes-root-dashboard.service | 21 +++++++ systemd/uma-hermes-dashboard.service | 21 +++++++ 5 files changed, 203 insertions(+), 29 deletions(-) create mode 100644 systemd/hermes-root-dashboard.service create mode 100644 systemd/uma-hermes-dashboard.service diff --git a/docs/hermes-operations.md b/docs/hermes-operations.md index d9a57c3..501e573 100644 --- a/docs/hermes-operations.md +++ b/docs/hermes-operations.md @@ -18,6 +18,9 @@ Observed on 2026-05-27: - Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only - Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval - Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10` +- Private dashboards: + - Root: `http://100.87.53.10:9119/`, `hermes-root-dashboard.service` + - Uma: `http://100.87.53.10:9120/`, `uma-hermes-dashboard.service` ## Safety guardrail: no public Hermes dashboard/API @@ -48,6 +51,25 @@ tailscale ip -4 # Expected server IPv4: 100.87.53.10 ``` +Private dashboard services: + +```bash +systemctl status hermes-root-dashboard --no-pager +systemctl status uma-hermes-dashboard --no-pager +ss -ltnp | grep -E ':(9119|9120)' + +# Expected listeners are Tailscale-only: +# 100.87.53.10:9119 +# 100.87.53.10:9120 +``` + +Tracked service unit templates: + +```bash +systemd/hermes-root-dashboard.service +systemd/uma-hermes-dashboard.service +``` + ## Health baseline commands ```bash @@ -115,6 +137,7 @@ Behavior: - no output on success, so the cron stays silent - sends a Telegram message only when it detects an actionable failure - checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage +- also checks memory pressure plus critical Caddy/Gitea Docker containers (`caddy`, `gitea-npm-registry`) Manual smoke test: @@ -202,3 +225,27 @@ Restart/reset requirement: - gateway config changes: `/restart` from Telegram or `hermes gateway restart` - CLI session tool changes: start a new session or `/reset` - provider auth changes: start a new session after switching models/providers + +## Telegram topics and session handling + +Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need. + +Review these before changing Telegram routing: + +```bash +systemctl status hermes-gateway --no-pager +sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager +grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100 +``` + +## Multi-agent execution conventions + +Use the smallest execution surface that fits the task: + +- direct tool call: one-shot local checks, edits, commits, pushes, status reads +- `delegate_task`: bounded research or code inspection that can return inside the parent session +- background terminal process: long-running local commands that need monitoring +- cron job: recurring, deterministic, silent-on-success maintenance +- Kanban worker: durable multi-agent project coordination after the board is intentionally configured + +Telegram progress/completion updates should keep the user's numbered-prefix convention (`1`, `2`, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable. diff --git a/docs/hermes-setup-upgrade-roadmap.md b/docs/hermes-setup-upgrade-roadmap.md index ce9cc08..9942b99 100644 --- a/docs/hermes-setup-upgrade-roadmap.md +++ b/docs/hermes-setup-upgrade-roadmap.md @@ -8,10 +8,18 @@ ## Completion Status -- **Overall checklist completion:** ~57% (`102/179` checked after auditing root-vs-Uma evidence on 2026-05-27). -- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, tailnet login, GitHub/Gitea tokens, and policy decisions. +- **Overall checklist completion:** ~68% (`122/179` checked after the 2026-05-27 dashboard/watchdog/runbook audit). +- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, GitHub/Gitea tokens, Uma backup design, and policy decisions. - vijay: percentage is based on literal Markdown checklist boxes, including nested sub-items. It intentionally counts credential-dependent future work as incomplete. +## Remaining Unchecked Item Classification + +- **Needs credentials/API keys:** fallback provider setup, web search/extract backend, GitHub/Gitea automation token, Browserbase/Browser Use, and provider fallback tests. +- **Needs explicit policy decision:** Cloudflare Access/basic-auth public fallback, model-routing tiers, local browser automation, vision/image provider choice, `security.redact_secrets`, `privacy.redact_pii`, and credential rotation. +- **Needs Uma backup design:** Uma/Bheem currently has a clean VM wrapper repo, but not a root-style sanitized Hermes persistent backup/restore workflow. +- **Needs manual UX validation:** dashboard feature-by-feature checks, Telegram approval prompt flow, and Telegram media/file delivery. +- **Needs future workflow adoption:** practicing `delegate_task`, spawned/tmux sessions, worktrees, and Kanban on real tasks before checking them as completed. + ## Purpose Turn the Hermes setup ideas from the referenced video into a practical ByteLyst upgrade checklist for this VM-backed, Telegram-driven Hermes installation. @@ -243,7 +251,8 @@ A healthy ByteLyst Hermes setup should be: ### Phase 6 — Telegram Gateway Workflow - [x] Keep Telegram as the primary control plane. - - vijay: watchdog delivery is configured to the origin Telegram conversation; dashboard remains private-only/pending. + - vijay: watchdog delivery is configured to the origin Telegram conversation; root dashboard is private-only over Tailscale. + - bheem: Uma gateway remains Telegram-driven; Uma dashboard is private-only over Tailscale. - [x] Preserve the user's preferred progress prefix convention: `1️⃣`, `2️⃣`, etc. - vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions. - [x] Ensure home channel and allowed user settings are correct. @@ -254,22 +263,32 @@ A healthy ByteLyst Hermes setup should be: - [x] outbound completion message - [ ] approval prompt flow - [ ] media/file delivery -- [ ] Decide whether Telegram topic/session handling should be enabled or documented. +- [x] Decide whether Telegram topic/session handling should be enabled or documented. + - vijay: documented current stance in `docs/hermes-operations.md`: keep default Telegram session handling unless a concrete topic-routing need appears. + - bheem: same default-session stance applies to Uma/Bheem. - [x] Add a runbook for gateway restart/recovery. - vijay: added gateway recovery section to `docs/hermes-operations.md`. ### Phase 7 — Memory, Skills, And Knowledge Capture -- [ ] Review persistent memory for stale entries and trim anything no longer useful. -- [ ] Keep memories declarative and durable; avoid storing task-completion artifacts. +- [x] Review persistent memory for stale entries and trim anything no longer useful. + - vijay: reviewed root `MEMORY.md` and `USER.md`; entries are operationally relevant, no safe deletion needed. + - bheem: reviewed Uma `MEMORY.md` and `USER.md`; entries are current Bheem context, no safe deletion needed. +- [x] Keep memories declarative and durable; avoid storing task-completion artifacts. + - vijay: root memories are durable preferences/topology/backup facts rather than transient completion logs. + - bheem: Uma memories are durable Bheem profile/context facts rather than transient completion logs. - [ ] Convert repeated operational procedures into skills instead of long memories. - [ ] Pin critical ByteLyst/Hermes skills that should not be archived. - [ ] Schedule or manually run curator reviews if enabled. - [ ] Add skills for recurring ByteLyst workflows: - - [ ] Gitea Actions troubleshooting - - [ ] Caddy + Docker routing changes - - [ ] Hermes backup/restore drill - - [ ] Telegram gateway recovery + - [x] Gitea Actions troubleshooting + - vijay: root has `devops/self-hosted-gitea-ci`. + - [x] Caddy + Docker routing changes + - vijay: root has `devops/caddy-subdomain-routing`. + - [x] Hermes backup/restore drill + - vijay: root has `devops/hermes-persistent-backup-ops`; Uma backup workflow remains separate and not equivalent. + - [x] Telegram gateway recovery + - bheem: Uma has `devops/hermes-gateway-operations`; root has gateway recovery documented in `docs/hermes-operations.md`. - [ ] safe multi-repo commit/push workflow ### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance @@ -282,8 +301,10 @@ A healthy ByteLyst Hermes setup should be: - [x] cron scheduler stale - [x] backup job failed or no fresh commit within threshold - [x] disk usage high - - [ ] memory pressure high - - [ ] Caddy/Gitea critical services down + - [x] memory pressure high + - vijay: added `/proc/meminfo` memory-pressure threshold check to `scripts/hermes-health-watchdog.py`, deployed to `~/.hermes/scripts/hermes_health_watchdog.py`, and verified silent-on-success. + - [x] Caddy/Gitea critical services down + - vijay: added critical Docker container checks for `caddy` and `gitea-npm-registry`; deployed watchdog remains silent on a healthy run. - [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks. - vijay: watchdog cron is no-agent/script-only and silent on success. - [x] Keep noisy health checks silent on success. @@ -298,8 +319,8 @@ A healthy ByteLyst Hermes setup should be: - [x] Do not expose Hermes dashboard publicly. - vijay: no public dashboard/API route added; private-only policy documented. - [x] If a dashboard is useful, make it private-only and operationally scoped. - - vijay: selected private-only dashboard direction; Tailscale is connected at `100.87.53.10`. Dashboard itself is not running and no `9119/9120` listener is exposed. - - bheem: Uma dashboard access should use the same private-only Tailscale host path; no Uma dashboard listener is exposed. + - vijay: root dashboard is running as `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`, bound only to the Tailscale IP. + - bheem: Uma dashboard is running as `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`, bound only to the Tailscale IP. - [ ] Dashboard should show: - [ ] gateway status - [ ] active sessions @@ -307,9 +328,11 @@ A healthy ByteLyst Hermes setup should be: - [ ] backup freshness - [ ] recent sanitized alerts - [ ] quick links to docs/runbooks + - vijay: root dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending. + - bheem: Uma dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending. - [x] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel. - - vijay: standing decision is local/Tailscale/SSH-only. Tailnet login is complete; dashboard auth validation remains a future task if the dashboard is started. - - bheem: same standing decision for Uma; no public dashboard route should be added. + - vijay: root dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added. + - bheem: Uma dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added. - [x] Add a Caddy review step before adding any new hostname. - vijay: added Caddy/port review commands to `docs/hermes-operations.md`. @@ -319,13 +342,16 @@ A healthy ByteLyst Hermes setup should be: - [ ] Use spawned Hermes/tmux sessions only for long-running missions that must outlive the parent turn. - [ ] Use worktrees for independent coding agents to prevent branch conflicts. - [ ] For durable multi-agent coordination, evaluate Hermes Kanban. -- [ ] Document when to use: - - [ ] direct tool call - - [ ] delegate_task - - [ ] background terminal process - - [ ] cron job - - [ ] Kanban worker -- [ ] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions. +- [x] Document when to use: + - [x] direct tool call + - [x] delegate_task + - [x] background terminal process + - [x] cron job + - [x] Kanban worker + - vijay: added multi-agent execution convention guidance to `docs/hermes-operations.md`. +- [x] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions. + - vijay: documented the numbered/emoji-prefix convention in `docs/hermes-operations.md`. + - bheem: Uma/Bheem follows the same convention. ### Phase 11 — Security And Secret Hygiene @@ -348,8 +374,9 @@ A healthy ByteLyst Hermes setup should be: - vijay: created `docs/hermes-operations.md`. - [x] Link this roadmap from `docs/repo-map.md`. - vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map. -- [ ] Create or update runbooks for: - - [ ] installing/upgrading Hermes +- [x] Create or update runbooks for: + - [x] installing/upgrading Hermes + - vijay: `docs/hermes-operations.md` contains upgrade commands and late-upgrade verification notes. - [x] restarting the gateway - [x] restoring persistent data from backup - [x] configuring providers/models @@ -384,9 +411,12 @@ A healthy ByteLyst Hermes setup should be: ### Medium-Term — This Month -- [ ] Evaluate private-only dashboard/mission-control UX. +- [x] Evaluate private-only dashboard/mission-control UX. + - vijay: root dashboard is reachable via Tailscale at `http://100.87.53.10:9119/`. + - bheem: Uma dashboard is reachable via Tailscale at `http://100.87.53.10:9120/`. - [ ] Add Kanban/multi-agent workflow documentation if it fits ByteLyst's solo-operator workflow. -- [ ] Add silent-on-success system watchdogs. +- [x] Add silent-on-success system watchdogs. + - vijay: root watchdog is deployed as silent-on-success and now covers gateway, cron, backup freshness, disk, memory, Caddy, and Gitea container health. - [ ] Clean up stale memory/skills and pin critical skills. - [ ] Schedule quarterly restore drills. - vijay: quarterly restore drill reminder cron is configured for root. @@ -433,6 +463,12 @@ This roadmap is complete when: - vijay: confirmed root service is enabled and active. - bheem: confirmed Uma service is enabled and active; Docker-based Uma Hermes remains removed. - vijay: installed Tailscale `1.98.3`; `tailscaled` is enabled/running and authenticated to tailnet IP `100.87.53.10`. +- vijay: installed permanent root dashboard service `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`. +- bheem: installed permanent Uma dashboard service `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`. +- vijay: added dashboard service unit templates under `systemd/` for repo tracking. +- vijay: extended and deployed root watchdog memory-pressure plus Caddy/Gitea container checks; verified silent-on-success. +- vijay: reviewed root persistent memories and recurring workflow skills. +- bheem: reviewed Uma persistent memories and recurring workflow skills. - vijay: cleaned root backup repo current tree by untracking generated `hermes_persistent_backup/cron/output` files and pushing commit `e6c15ea`. - bheem: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal. - vijay: ran root restore rehearsal into `/tmp/hermes-restore-test-root`, verified portable restore content, and scanned restored config/template for common token patterns. diff --git a/scripts/hermes-health-watchdog.py b/scripts/hermes-health-watchdog.py index de0ce4b..7d25cf0 100755 --- a/scripts/hermes-health-watchdog.py +++ b/scripts/hermes-health-watchdog.py @@ -14,9 +14,15 @@ from datetime import datetime, timezone from pathlib import Path DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85")) +MEMORY_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_MEMORY_WARN_PERCENT", "90")) BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90")) BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub") GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service") +DOCKER_CONTAINERS = [ + item.strip() + for item in os.getenv("HERMES_WATCHDOG_DOCKER_CONTAINERS", "caddy,gitea-npm-registry").split(",") + if item.strip() +] HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes"))) @@ -73,9 +79,49 @@ def check_disk(alerts: list[str]) -> None: alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)") +def check_memory(alerts: list[str]) -> None: + meminfo: dict[str, int] = {} + for line in Path("/proc/meminfo").read_text(encoding="utf-8").splitlines(): + parts = line.split() + if len(parts) >= 2: + meminfo[parts[0].rstrip(":")] = int(parts[1]) + total = meminfo.get("MemTotal", 0) + available = meminfo.get("MemAvailable", 0) + if total <= 0 or available <= 0: + alerts.append("could not read memory pressure from /proc/meminfo") + return + used_pct = int(round(((total - available) / total) * 100)) + if used_pct >= MEMORY_WARN_PERCENT: + alerts.append(f"memory pressure is high: {used_pct}% used (threshold {MEMORY_WARN_PERCENT}%)") + + +def check_docker_containers(alerts: list[str]) -> None: + if not DOCKER_CONTAINERS: + return + docker = shutil.which("docker") + if not docker: + alerts.append("docker executable not found; cannot verify critical containers") + return + result = run([docker, "ps", "--format", "{{.Names}}"], timeout=20) + if result.returncode != 0: + alerts.append(f"`docker ps` failed while checking critical containers: {result.stderr.strip() or result.stdout.strip()}") + return + running = set(result.stdout.splitlines()) + missing = [name for name in DOCKER_CONTAINERS if name not in running] + if missing: + alerts.append(f"critical Docker container(s) not running: {', '.join(missing)}") + + def main() -> int: alerts: list[str] = [] - for check in (check_gateway, check_backup_cron, check_backup_repo_freshness, check_disk): + for check in ( + check_gateway, + check_backup_cron, + check_backup_repo_freshness, + check_disk, + check_memory, + check_docker_containers, + ): try: check(alerts) except Exception as exc: # noqa: BLE001 - watchdog should alert, not crash silently @@ -85,7 +131,10 @@ def main() -> int: print("🚨 ByteLyst Hermes watchdog alert") for item in alerts: print(f"- {item}") - print("\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, `hermes cron list`, `df -h /`.") + print( + "\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, " + "`hermes cron list`, `df -h /`, `free -h`, `docker ps`." + ) return 0 return 0 diff --git a/systemd/hermes-root-dashboard.service b/systemd/hermes-root-dashboard.service new file mode 100644 index 0000000..96e1baf --- /dev/null +++ b/systemd/hermes-root-dashboard.service @@ -0,0 +1,21 @@ +[Unit] +Description=Root Hermes Dashboard (Tailscale private) +After=network-online.target tailscaled.service +Wants=network-online.target + +[Service] +Type=simple +User=root +Group=root +WorkingDirectory=/usr/local/lib/hermes-agent +Environment="HOME=/root" +Environment="USER=root" +Environment="LOGNAME=root" +Environment="HERMES_HOME=/root/.hermes" +Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" +ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9119 --no-open --insecure --skip-build +Restart=always +RestartSec=5 + +[Install] +WantedBy=multi-user.target diff --git a/systemd/uma-hermes-dashboard.service b/systemd/uma-hermes-dashboard.service new file mode 100644 index 0000000..b5db599 --- /dev/null +++ b/systemd/uma-hermes-dashboard.service @@ -0,0 +1,21 @@ +[Unit] +Description=Uma Hermes Dashboard (Tailscale private) +After=network-online.target tailscaled.service +Wants=network-online.target + +[Service] +Type=simple +User=uma +Group=uma +WorkingDirectory=/usr/local/lib/hermes-agent +Environment="HOME=/home/uma" +Environment="USER=uma" +Environment="LOGNAME=uma" +Environment="HERMES_HOME=/home/uma/.hermes" +Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" +ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9120 --no-open --insecure --skip-build +Restart=always +RestartSec=5 + +[Install] +WantedBy=multi-user.target