Complete Hermes dashboard and watchdog roadmap audit

2026-05-27 10:45:20 +00:00 · 2026-05-27 10:45:20 +00:00 · 8de72351de
commit 8de72351de
parent 15ac960faf
5 changed files with 203 additions and 29 deletions
--- a/docs/hermes-operations.md
+++ b/docs/hermes-operations.md
@ -18,6 +18,9 @@ Observed on 2026-05-27:
 - Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
 - Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
 - Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10`
 - Private dashboards:
  - Root: `http://100.87.53.10:9119/`, `hermes-root-dashboard.service`
  - Uma: `http://100.87.53.10:9120/`, `uma-hermes-dashboard.service`
 ## Safety guardrail: no public Hermes dashboard/API
@ -48,6 +51,25 @@ tailscale ip -4
 # Expected server IPv4: 100.87.53.10
 ```
 Private dashboard services:
 ```bash
 systemctl status hermes-root-dashboard --no-pager
 systemctl status uma-hermes-dashboard --no-pager
 ss -ltnp | grep -E ':(9119|9120)'
 # Expected listeners are Tailscale-only:
 # 100.87.53.10:9119
 # 100.87.53.10:9120
 ```
 Tracked service unit templates:
 ```bash
 systemd/hermes-root-dashboard.service
 systemd/uma-hermes-dashboard.service
 ```
 ## Health baseline commands
 ```bash
@ -115,6 +137,7 @@ Behavior:
 - no output on success, so the cron stays silent
 - sends a Telegram message only when it detects an actionable failure
 - checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
 - also checks memory pressure plus critical Caddy/Gitea Docker containers (`caddy`, `gitea-npm-registry`)
 Manual smoke test:
@ -202,3 +225,27 @@ Restart/reset requirement:
 - gateway config changes: `/restart` from Telegram or `hermes gateway restart`
 - CLI session tool changes: start a new session or `/reset`
 - provider auth changes: start a new session after switching models/providers
 ## Telegram topics and session handling
 Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need.
 Review these before changing Telegram routing:
 ```bash
 systemctl status hermes-gateway --no-pager
 sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
 grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100
 ```
 ## Multi-agent execution conventions
 Use the smallest execution surface that fits the task:
 - direct tool call: one-shot local checks, edits, commits, pushes, status reads
 - `delegate_task`: bounded research or code inspection that can return inside the parent session
 - background terminal process: long-running local commands that need monitoring
 - cron job: recurring, deterministic, silent-on-success maintenance
 - Kanban worker: durable multi-agent project coordination after the board is intentionally configured
 Telegram progress/completion updates should keep the user's numbered-prefix convention (`1`, `2`, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable.
--- a/docs/hermes-setup-upgrade-roadmap.md
+++ b/docs/hermes-setup-upgrade-roadmap.md
@ -8,10 +8,18 @@
 ## Completion Status
- **Overall checklist completion:** ~57% (`102/179` checked after auditing root-vs-Uma evidence on 2026-05-27).
+- **Overall checklist completion:** ~68% (`122/179` checked after the 2026-05-27 dashboard/watchdog/runbook audit).
- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, tailnet login, GitHub/Gitea tokens, and policy decisions.
+- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, GitHub/Gitea tokens, Uma backup design, and policy decisions.
 - vijay: percentage is based on literal Markdown checklist boxes, including nested sub-items. It intentionally counts credential-dependent future work as incomplete.
 ## Remaining Unchecked Item Classification
 - **Needs credentials/API keys:** fallback provider setup, web search/extract backend, GitHub/Gitea automation token, Browserbase/Browser Use, and provider fallback tests.
 - **Needs explicit policy decision:** Cloudflare Access/basic-auth public fallback, model-routing tiers, local browser automation, vision/image provider choice, `security.redact_secrets`, `privacy.redact_pii`, and credential rotation.
 - **Needs Uma backup design:** Uma/Bheem currently has a clean VM wrapper repo, but not a root-style sanitized Hermes persistent backup/restore workflow.
 - **Needs manual UX validation:** dashboard feature-by-feature checks, Telegram approval prompt flow, and Telegram media/file delivery.
 - **Needs future workflow adoption:** practicing `delegate_task`, spawned/tmux sessions, worktrees, and Kanban on real tasks before checking them as completed.
 ## Purpose
 Turn the Hermes setup ideas from the referenced video into a practical ByteLyst upgrade checklist for this VM-backed, Telegram-driven Hermes installation.
@ -243,7 +251,8 @@ A healthy ByteLyst Hermes setup should be:
 ### Phase 6 — Telegram Gateway Workflow
 - [x] Keep Telegram as the primary control plane.
-  - vijay: watchdog delivery is configured to the origin Telegram conversation; dashboard remains private-only/pending.
+  - vijay: watchdog delivery is configured to the origin Telegram conversation; root dashboard is private-only over Tailscale.
  - bheem: Uma gateway remains Telegram-driven; Uma dashboard is private-only over Tailscale.
 - [x] Preserve the user's preferred progress prefix convention: `1️⃣`, `2️⃣`, etc.
  - vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions.
 - [x] Ensure home channel and allowed user settings are correct.
@ -254,22 +263,32 @@ A healthy ByteLyst Hermes setup should be:
  - [x] outbound completion message
  - [ ] approval prompt flow
  - [ ] media/file delivery
- [ ] Decide whether Telegram topic/session handling should be enabled or documented.
+- [x] Decide whether Telegram topic/session handling should be enabled or documented.
  - vijay: documented current stance in `docs/hermes-operations.md`: keep default Telegram session handling unless a concrete topic-routing need appears.
  - bheem: same default-session stance applies to Uma/Bheem.
 - [x] Add a runbook for gateway restart/recovery.
  - vijay: added gateway recovery section to `docs/hermes-operations.md`.
 ### Phase 7 — Memory, Skills, And Knowledge Capture
- [ ] Review persistent memory for stale entries and trim anything no longer useful.
+- [x] Review persistent memory for stale entries and trim anything no longer useful.
- [ ] Keep memories declarative and durable; avoid storing task-completion artifacts.
+  - vijay: reviewed root `MEMORY.md` and `USER.md`; entries are operationally relevant, no safe deletion needed.
  - bheem: reviewed Uma `MEMORY.md` and `USER.md`; entries are current Bheem context, no safe deletion needed.
 - [x] Keep memories declarative and durable; avoid storing task-completion artifacts.
  - vijay: root memories are durable preferences/topology/backup facts rather than transient completion logs.
  - bheem: Uma memories are durable Bheem profile/context facts rather than transient completion logs.
 - [ ] Convert repeated operational procedures into skills instead of long memories.
 - [ ] Pin critical ByteLyst/Hermes skills that should not be archived.
 - [ ] Schedule or manually run curator reviews if enabled.
 - [ ] Add skills for recurring ByteLyst workflows:
-  - [ ] Gitea Actions troubleshooting
+  - [x] Gitea Actions troubleshooting
-  - [ ] Caddy + Docker routing changes
+    - vijay: root has `devops/self-hosted-gitea-ci`.
-  - [ ] Hermes backup/restore drill
+  - [x] Caddy + Docker routing changes
-  - [ ] Telegram gateway recovery
+    - vijay: root has `devops/caddy-subdomain-routing`.
  - [x] Hermes backup/restore drill
    - vijay: root has `devops/hermes-persistent-backup-ops`; Uma backup workflow remains separate and not equivalent.
  - [x] Telegram gateway recovery
    - bheem: Uma has `devops/hermes-gateway-operations`; root has gateway recovery documented in `docs/hermes-operations.md`.
  - [ ] safe multi-repo commit/push workflow
 ### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance
@ -282,8 +301,10 @@ A healthy ByteLyst Hermes setup should be:
  - [x] cron scheduler stale
  - [x] backup job failed or no fresh commit within threshold
  - [x] disk usage high
-  - [ ] memory pressure high
+  - [x] memory pressure high
-  - [ ] Caddy/Gitea critical services down
+    - vijay: added `/proc/meminfo` memory-pressure threshold check to `scripts/hermes-health-watchdog.py`, deployed to `~/.hermes/scripts/hermes_health_watchdog.py`, and verified silent-on-success.
  - [x] Caddy/Gitea critical services down
    - vijay: added critical Docker container checks for `caddy` and `gitea-npm-registry`; deployed watchdog remains silent on a healthy run.
 - [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
  - vijay: watchdog cron is no-agent/script-only and silent on success.
 - [x] Keep noisy health checks silent on success.
@ -298,8 +319,8 @@ A healthy ByteLyst Hermes setup should be:
 - [x] Do not expose Hermes dashboard publicly.
  - vijay: no public dashboard/API route added; private-only policy documented.
 - [x] If a dashboard is useful, make it private-only and operationally scoped.
-  - vijay: selected private-only dashboard direction; Tailscale is connected at `100.87.53.10`. Dashboard itself is not running and no `9119/9120` listener is exposed.
+  - vijay: root dashboard is running as `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`, bound only to the Tailscale IP.
-  - bheem: Uma dashboard access should use the same private-only Tailscale host path; no Uma dashboard listener is exposed.
+  - bheem: Uma dashboard is running as `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`, bound only to the Tailscale IP.
 - [ ] Dashboard should show:
  - [ ] gateway status
  - [ ] active sessions
@ -307,9 +328,11 @@ A healthy ByteLyst Hermes setup should be:
  - [ ] backup freshness
  - [ ] recent sanitized alerts
  - [ ] quick links to docs/runbooks
  - vijay: root dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending.
  - bheem: Uma dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending.
 - [x] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel.
-  - vijay: standing decision is local/Tailscale/SSH-only. Tailnet login is complete; dashboard auth validation remains a future task if the dashboard is started.
+  - vijay: root dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
-  - bheem: same standing decision for Uma; no public dashboard route should be added.
+  - bheem: Uma dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
 - [x] Add a Caddy review step before adding any new hostname.
  - vijay: added Caddy/port review commands to `docs/hermes-operations.md`.
@ -319,13 +342,16 @@ A healthy ByteLyst Hermes setup should be:
 - [ ] Use spawned Hermes/tmux sessions only for long-running missions that must outlive the parent turn.
 - [ ] Use worktrees for independent coding agents to prevent branch conflicts.
 - [ ] For durable multi-agent coordination, evaluate Hermes Kanban.
- [ ] Document when to use:
+- [x] Document when to use:
-  - [ ] direct tool call
+  - [x] direct tool call
-  - [ ] delegate_task
+  - [x] delegate_task
-  - [ ] background terminal process
+  - [x] background terminal process
-  - [ ] cron job
+  - [x] cron job
-  - [ ] Kanban worker
+  - [x] Kanban worker
- [ ] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions.
+  - vijay: added multi-agent execution convention guidance to `docs/hermes-operations.md`.
 - [x] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions.
  - vijay: documented the numbered/emoji-prefix convention in `docs/hermes-operations.md`.
  - bheem: Uma/Bheem follows the same convention.
 ### Phase 11 — Security And Secret Hygiene
@ -348,8 +374,9 @@ A healthy ByteLyst Hermes setup should be:
  - vijay: created `docs/hermes-operations.md`.
 - [x] Link this roadmap from `docs/repo-map.md`.
  - vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map.
- [ ] Create or update runbooks for:
+- [x] Create or update runbooks for:
-  - [ ] installing/upgrading Hermes
+  - [x] installing/upgrading Hermes
    - vijay: `docs/hermes-operations.md` contains upgrade commands and late-upgrade verification notes.
  - [x] restarting the gateway
  - [x] restoring persistent data from backup
  - [x] configuring providers/models
@ -384,9 +411,12 @@ A healthy ByteLyst Hermes setup should be:
 ### Medium-Term — This Month
- [ ] Evaluate private-only dashboard/mission-control UX.
+- [x] Evaluate private-only dashboard/mission-control UX.
  - vijay: root dashboard is reachable via Tailscale at `http://100.87.53.10:9119/`.
  - bheem: Uma dashboard is reachable via Tailscale at `http://100.87.53.10:9120/`.
 - [ ] Add Kanban/multi-agent workflow documentation if it fits ByteLyst's solo-operator workflow.
- [ ] Add silent-on-success system watchdogs.
+- [x] Add silent-on-success system watchdogs.
  - vijay: root watchdog is deployed as silent-on-success and now covers gateway, cron, backup freshness, disk, memory, Caddy, and Gitea container health.
 - [ ] Clean up stale memory/skills and pin critical skills.
 - [ ] Schedule quarterly restore drills.
  - vijay: quarterly restore drill reminder cron is configured for root.
@ -433,6 +463,12 @@ This roadmap is complete when:
 - vijay: confirmed root service is enabled and active.
 - bheem: confirmed Uma service is enabled and active; Docker-based Uma Hermes remains removed.
 - vijay: installed Tailscale `1.98.3`; `tailscaled` is enabled/running and authenticated to tailnet IP `100.87.53.10`.
 - vijay: installed permanent root dashboard service `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`.
 - bheem: installed permanent Uma dashboard service `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`.
 - vijay: added dashboard service unit templates under `systemd/` for repo tracking.
 - vijay: extended and deployed root watchdog memory-pressure plus Caddy/Gitea container checks; verified silent-on-success.
 - vijay: reviewed root persistent memories and recurring workflow skills.
 - bheem: reviewed Uma persistent memories and recurring workflow skills.
 - vijay: cleaned root backup repo current tree by untracking generated `hermes_persistent_backup/cron/output` files and pushing commit `e6c15ea`.
 - bheem: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal.
 - vijay: ran root restore rehearsal into `/tmp/hermes-restore-test-root`, verified portable restore content, and scanned restored config/template for common token patterns.
--- a/scripts/hermes-health-watchdog.py
+++ b/scripts/hermes-health-watchdog.py
@ -14,9 +14,15 @@ from datetime import datetime, timezone
 from pathlib import Path
 DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85"))
 MEMORY_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_MEMORY_WARN_PERCENT", "90"))
 BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90"))
 BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub")
 GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service")
 DOCKER_CONTAINERS = [
    item.strip()
    for item in os.getenv("HERMES_WATCHDOG_DOCKER_CONTAINERS", "caddy,gitea-npm-registry").split(",")
    if item.strip()
 ]
 HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes")))
@ -73,9 +79,49 @@ def check_disk(alerts: list[str]) -> None:
        alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)")
 def check_memory(alerts: list[str]) -> None:
    meminfo: dict[str, int] = {}
    for line in Path("/proc/meminfo").read_text(encoding="utf-8").splitlines():
        parts = line.split()
        if len(parts) >= 2:
            meminfo[parts[0].rstrip(":")] = int(parts[1])
    total = meminfo.get("MemTotal", 0)
    available = meminfo.get("MemAvailable", 0)
    if total <= 0 or available <= 0:
        alerts.append("could not read memory pressure from /proc/meminfo")
        return
    used_pct = int(round(((total - available) / total) * 100))
    if used_pct >= MEMORY_WARN_PERCENT:
        alerts.append(f"memory pressure is high: {used_pct}% used (threshold {MEMORY_WARN_PERCENT}%)")
 def check_docker_containers(alerts: list[str]) -> None:
    if not DOCKER_CONTAINERS:
        return
    docker = shutil.which("docker")
    if not docker:
        alerts.append("docker executable not found; cannot verify critical containers")
        return
    result = run([docker, "ps", "--format", "{{.Names}}"], timeout=20)
    if result.returncode != 0:
        alerts.append(f"`docker ps` failed while checking critical containers: {result.stderr.strip() or result.stdout.strip()}")
        return
    running = set(result.stdout.splitlines())
    missing = [name for name in DOCKER_CONTAINERS if name not in running]
    if missing:
        alerts.append(f"critical Docker container(s) not running: {', '.join(missing)}")
 def main() -> int:
    alerts: list[str] = []
-    for check in (check_gateway, check_backup_cron, check_backup_repo_freshness, check_disk):
+    for check in (
        check_gateway,
        check_backup_cron,
        check_backup_repo_freshness,
        check_disk,
        check_memory,
        check_docker_containers,
    ):
        try:
            check(alerts)
        except Exception as exc:  # noqa: BLE001 - watchdog should alert, not crash silently
@ -85,7 +131,10 @@ def main() -> int:
        print("🚨 ByteLyst Hermes watchdog alert")
        for item in alerts:
            print(f"- {item}")
-        print("\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, `hermes cron list`, `df -h /`.")
+        print(
            "\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, "
            "`hermes cron list`, `df -h /`, `free -h`, `docker ps`."
        )
        return 0
    return 0
--- a/systemd/hermes-root-dashboard.service
+++ b/systemd/hermes-root-dashboard.service
@ -0,0 +1,21 @@
 [Unit]
 Description=Root Hermes Dashboard (Tailscale private)
 After=network-online.target tailscaled.service
 Wants=network-online.target
 [Service]
 Type=simple
 User=root
 Group=root
 WorkingDirectory=/usr/local/lib/hermes-agent
 Environment="HOME=/root"
 Environment="USER=root"
 Environment="LOGNAME=root"
 Environment="HERMES_HOME=/root/.hermes"
 Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
 ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9119 --no-open --insecure --skip-build
 Restart=always
 RestartSec=5
 [Install]
 WantedBy=multi-user.target
--- a/systemd/uma-hermes-dashboard.service
+++ b/systemd/uma-hermes-dashboard.service
@ -0,0 +1,21 @@
 [Unit]
 Description=Uma Hermes Dashboard (Tailscale private)
 After=network-online.target tailscaled.service
 Wants=network-online.target
 [Service]
 Type=simple
 User=uma
 Group=uma
 WorkingDirectory=/usr/local/lib/hermes-agent
 Environment="HOME=/home/uma"
 Environment="USER=uma"
 Environment="LOGNAME=uma"
 Environment="HERMES_HOME=/home/uma/.hermes"
 Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
 ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9120 --no-open --insecure --skip-build
 Restart=always
 RestartSec=5
 [Install]
 WantedBy=multi-user.target