From 8de72351dec9f5674470e8d2f329701945c20252 Mon Sep 17 00:00:00 2001
From: root <root@srv1491630.hstgr.cloud>
Date: Wed, 27 May 2026 10:45:20 +0000
Subject: [PATCH] Complete Hermes dashboard and watchdog roadmap audit

---
 docs/hermes-operations.md             | 47 ++++++++++++++
 docs/hermes-setup-upgrade-roadmap.md  | 90 +++++++++++++++++++--------
 scripts/hermes-health-watchdog.py     | 53 +++++++++++++++-
 systemd/hermes-root-dashboard.service | 21 +++++++
 systemd/uma-hermes-dashboard.service  | 21 +++++++
 5 files changed, 203 insertions(+), 29 deletions(-)
 create mode 100644 systemd/hermes-root-dashboard.service
 create mode 100644 systemd/uma-hermes-dashboard.service

diff --git a/docs/hermes-operations.md b/docs/hermes-operations.md
index d9a57c3..501e573 100644
--- a/docs/hermes-operations.md
+++ b/docs/hermes-operations.md
@@ -18,6 +18,9 @@ Observed on 2026-05-27:
 - Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
 - Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
 - Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10`
+- Private dashboards:
+  - Root: `http://100.87.53.10:9119/`, `hermes-root-dashboard.service`
+  - Uma: `http://100.87.53.10:9120/`, `uma-hermes-dashboard.service`
 
 ## Safety guardrail: no public Hermes dashboard/API
 
@@ -48,6 +51,25 @@ tailscale ip -4
 # Expected server IPv4: 100.87.53.10
 ```
 
+Private dashboard services:
+
+```bash
+systemctl status hermes-root-dashboard --no-pager
+systemctl status uma-hermes-dashboard --no-pager
+ss -ltnp | grep -E ':(9119|9120)'
+
+# Expected listeners are Tailscale-only:
+# 100.87.53.10:9119
+# 100.87.53.10:9120
+```
+
+Tracked service unit templates:
+
+```bash
+systemd/hermes-root-dashboard.service
+systemd/uma-hermes-dashboard.service
+```
+
 ## Health baseline commands
 
 ```bash
@@ -115,6 +137,7 @@ Behavior:
 - no output on success, so the cron stays silent
 - sends a Telegram message only when it detects an actionable failure
 - checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
+- also checks memory pressure plus critical Caddy/Gitea Docker containers (`caddy`, `gitea-npm-registry`)
 
 Manual smoke test:
 
@@ -202,3 +225,27 @@ Restart/reset requirement:
 - gateway config changes: `/restart` from Telegram or `hermes gateway restart`
 - CLI session tool changes: start a new session or `/reset`
 - provider auth changes: start a new session after switching models/providers
+
+## Telegram topics and session handling
+
+Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need.
+
+Review these before changing Telegram routing:
+
+```bash
+systemctl status hermes-gateway --no-pager
+sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
+grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100
+```
+
+## Multi-agent execution conventions
+
+Use the smallest execution surface that fits the task:
+
+- direct tool call: one-shot local checks, edits, commits, pushes, status reads
+- `delegate_task`: bounded research or code inspection that can return inside the parent session
+- background terminal process: long-running local commands that need monitoring
+- cron job: recurring, deterministic, silent-on-success maintenance
+- Kanban worker: durable multi-agent project coordination after the board is intentionally configured
+
+Telegram progress/completion updates should keep the user's numbered-prefix convention (`1`, `2`, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable.
diff --git a/docs/hermes-setup-upgrade-roadmap.md b/docs/hermes-setup-upgrade-roadmap.md
index ce9cc08..9942b99 100644
--- a/docs/hermes-setup-upgrade-roadmap.md
+++ b/docs/hermes-setup-upgrade-roadmap.md
@@ -8,10 +8,18 @@
 
 ## Completion Status
 
-- **Overall checklist completion:** ~57% (`102/179` checked after auditing root-vs-Uma evidence on 2026-05-27).
-- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, tailnet login, GitHub/Gitea tokens, and policy decisions.
+- **Overall checklist completion:** ~68% (`122/179` checked after the 2026-05-27 dashboard/watchdog/runbook audit).
+- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, GitHub/Gitea tokens, Uma backup design, and policy decisions.
 - vijay: percentage is based on literal Markdown checklist boxes, including nested sub-items. It intentionally counts credential-dependent future work as incomplete.
 
+## Remaining Unchecked Item Classification
+
+- **Needs credentials/API keys:** fallback provider setup, web search/extract backend, GitHub/Gitea automation token, Browserbase/Browser Use, and provider fallback tests.
+- **Needs explicit policy decision:** Cloudflare Access/basic-auth public fallback, model-routing tiers, local browser automation, vision/image provider choice, `security.redact_secrets`, `privacy.redact_pii`, and credential rotation.
+- **Needs Uma backup design:** Uma/Bheem currently has a clean VM wrapper repo, but not a root-style sanitized Hermes persistent backup/restore workflow.
+- **Needs manual UX validation:** dashboard feature-by-feature checks, Telegram approval prompt flow, and Telegram media/file delivery.
+- **Needs future workflow adoption:** practicing `delegate_task`, spawned/tmux sessions, worktrees, and Kanban on real tasks before checking them as completed.
+
 ## Purpose
 
 Turn the Hermes setup ideas from the referenced video into a practical ByteLyst upgrade checklist for this VM-backed, Telegram-driven Hermes installation.
@@ -243,7 +251,8 @@ A healthy ByteLyst Hermes setup should be:
 ### Phase 6 — Telegram Gateway Workflow
 
 - [x] Keep Telegram as the primary control plane.
-  - vijay: watchdog delivery is configured to the origin Telegram conversation; dashboard remains private-only/pending.
+  - vijay: watchdog delivery is configured to the origin Telegram conversation; root dashboard is private-only over Tailscale.
+  - bheem: Uma gateway remains Telegram-driven; Uma dashboard is private-only over Tailscale.
 - [x] Preserve the user's preferred progress prefix convention: `1️⃣`, `2️⃣`, etc.
   - vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions.
 - [x] Ensure home channel and allowed user settings are correct.
@@ -254,22 +263,32 @@ A healthy ByteLyst Hermes setup should be:
   - [x] outbound completion message
   - [ ] approval prompt flow
   - [ ] media/file delivery
-- [ ] Decide whether Telegram topic/session handling should be enabled or documented.
+- [x] Decide whether Telegram topic/session handling should be enabled or documented.
+  - vijay: documented current stance in `docs/hermes-operations.md`: keep default Telegram session handling unless a concrete topic-routing need appears.
+  - bheem: same default-session stance applies to Uma/Bheem.
 - [x] Add a runbook for gateway restart/recovery.
   - vijay: added gateway recovery section to `docs/hermes-operations.md`.
 
 ### Phase 7 — Memory, Skills, And Knowledge Capture
 
-- [ ] Review persistent memory for stale entries and trim anything no longer useful.
-- [ ] Keep memories declarative and durable; avoid storing task-completion artifacts.
+- [x] Review persistent memory for stale entries and trim anything no longer useful.
+  - vijay: reviewed root `MEMORY.md` and `USER.md`; entries are operationally relevant, no safe deletion needed.
+  - bheem: reviewed Uma `MEMORY.md` and `USER.md`; entries are current Bheem context, no safe deletion needed.
+- [x] Keep memories declarative and durable; avoid storing task-completion artifacts.
+  - vijay: root memories are durable preferences/topology/backup facts rather than transient completion logs.
+  - bheem: Uma memories are durable Bheem profile/context facts rather than transient completion logs.
 - [ ] Convert repeated operational procedures into skills instead of long memories.
 - [ ] Pin critical ByteLyst/Hermes skills that should not be archived.
 - [ ] Schedule or manually run curator reviews if enabled.
 - [ ] Add skills for recurring ByteLyst workflows:
-  - [ ] Gitea Actions troubleshooting
-  - [ ] Caddy + Docker routing changes
-  - [ ] Hermes backup/restore drill
-  - [ ] Telegram gateway recovery
+  - [x] Gitea Actions troubleshooting
+    - vijay: root has `devops/self-hosted-gitea-ci`.
+  - [x] Caddy + Docker routing changes
+    - vijay: root has `devops/caddy-subdomain-routing`.
+  - [x] Hermes backup/restore drill
+    - vijay: root has `devops/hermes-persistent-backup-ops`; Uma backup workflow remains separate and not equivalent.
+  - [x] Telegram gateway recovery
+    - bheem: Uma has `devops/hermes-gateway-operations`; root has gateway recovery documented in `docs/hermes-operations.md`.
   - [ ] safe multi-repo commit/push workflow
 
 ### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance
@@ -282,8 +301,10 @@ A healthy ByteLyst Hermes setup should be:
   - [x] cron scheduler stale
   - [x] backup job failed or no fresh commit within threshold
   - [x] disk usage high
-  - [ ] memory pressure high
-  - [ ] Caddy/Gitea critical services down
+  - [x] memory pressure high
+    - vijay: added `/proc/meminfo` memory-pressure threshold check to `scripts/hermes-health-watchdog.py`, deployed to `~/.hermes/scripts/hermes_health_watchdog.py`, and verified silent-on-success.
+  - [x] Caddy/Gitea critical services down
+    - vijay: added critical Docker container checks for `caddy` and `gitea-npm-registry`; deployed watchdog remains silent on a healthy run.
 - [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
   - vijay: watchdog cron is no-agent/script-only and silent on success.
 - [x] Keep noisy health checks silent on success.
@@ -298,8 +319,8 @@ A healthy ByteLyst Hermes setup should be:
 - [x] Do not expose Hermes dashboard publicly.
   - vijay: no public dashboard/API route added; private-only policy documented.
 - [x] If a dashboard is useful, make it private-only and operationally scoped.
-  - vijay: selected private-only dashboard direction; Tailscale is connected at `100.87.53.10`. Dashboard itself is not running and no `9119/9120` listener is exposed.
-  - bheem: Uma dashboard access should use the same private-only Tailscale host path; no Uma dashboard listener is exposed.
+  - vijay: root dashboard is running as `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`, bound only to the Tailscale IP.
+  - bheem: Uma dashboard is running as `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`, bound only to the Tailscale IP.
 - [ ] Dashboard should show:
   - [ ] gateway status
   - [ ] active sessions
@@ -307,9 +328,11 @@ A healthy ByteLyst Hermes setup should be:
   - [ ] backup freshness
   - [ ] recent sanitized alerts
   - [ ] quick links to docs/runbooks
+  - vijay: root dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending.
+  - bheem: Uma dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending.
 - [x] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel.
-  - vijay: standing decision is local/Tailscale/SSH-only. Tailnet login is complete; dashboard auth validation remains a future task if the dashboard is started.
-  - bheem: same standing decision for Uma; no public dashboard route should be added.
+  - vijay: root dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
+  - bheem: Uma dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
 - [x] Add a Caddy review step before adding any new hostname.
   - vijay: added Caddy/port review commands to `docs/hermes-operations.md`.
 
@@ -319,13 +342,16 @@ A healthy ByteLyst Hermes setup should be:
 - [ ] Use spawned Hermes/tmux sessions only for long-running missions that must outlive the parent turn.
 - [ ] Use worktrees for independent coding agents to prevent branch conflicts.
 - [ ] For durable multi-agent coordination, evaluate Hermes Kanban.
-- [ ] Document when to use:
-  - [ ] direct tool call
-  - [ ] delegate_task
-  - [ ] background terminal process
-  - [ ] cron job
-  - [ ] Kanban worker
-- [ ] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions.
+- [x] Document when to use:
+  - [x] direct tool call
+  - [x] delegate_task
+  - [x] background terminal process
+  - [x] cron job
+  - [x] Kanban worker
+  - vijay: added multi-agent execution convention guidance to `docs/hermes-operations.md`.
+- [x] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions.
+  - vijay: documented the numbered/emoji-prefix convention in `docs/hermes-operations.md`.
+  - bheem: Uma/Bheem follows the same convention.
 
 ### Phase 11 — Security And Secret Hygiene
 
@@ -348,8 +374,9 @@ A healthy ByteLyst Hermes setup should be:
   - vijay: created `docs/hermes-operations.md`.
 - [x] Link this roadmap from `docs/repo-map.md`.
   - vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map.
-- [ ] Create or update runbooks for:
-  - [ ] installing/upgrading Hermes
+- [x] Create or update runbooks for:
+  - [x] installing/upgrading Hermes
+    - vijay: `docs/hermes-operations.md` contains upgrade commands and late-upgrade verification notes.
   - [x] restarting the gateway
   - [x] restoring persistent data from backup
   - [x] configuring providers/models
@@ -384,9 +411,12 @@ A healthy ByteLyst Hermes setup should be:
 
 ### Medium-Term — This Month
 
-- [ ] Evaluate private-only dashboard/mission-control UX.
+- [x] Evaluate private-only dashboard/mission-control UX.
+  - vijay: root dashboard is reachable via Tailscale at `http://100.87.53.10:9119/`.
+  - bheem: Uma dashboard is reachable via Tailscale at `http://100.87.53.10:9120/`.
 - [ ] Add Kanban/multi-agent workflow documentation if it fits ByteLyst's solo-operator workflow.
-- [ ] Add silent-on-success system watchdogs.
+- [x] Add silent-on-success system watchdogs.
+  - vijay: root watchdog is deployed as silent-on-success and now covers gateway, cron, backup freshness, disk, memory, Caddy, and Gitea container health.
 - [ ] Clean up stale memory/skills and pin critical skills.
 - [ ] Schedule quarterly restore drills.
   - vijay: quarterly restore drill reminder cron is configured for root.
@@ -433,6 +463,12 @@ This roadmap is complete when:
 - vijay: confirmed root service is enabled and active.
 - bheem: confirmed Uma service is enabled and active; Docker-based Uma Hermes remains removed.
 - vijay: installed Tailscale `1.98.3`; `tailscaled` is enabled/running and authenticated to tailnet IP `100.87.53.10`.
+- vijay: installed permanent root dashboard service `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`.
+- bheem: installed permanent Uma dashboard service `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`.
+- vijay: added dashboard service unit templates under `systemd/` for repo tracking.
+- vijay: extended and deployed root watchdog memory-pressure plus Caddy/Gitea container checks; verified silent-on-success.
+- vijay: reviewed root persistent memories and recurring workflow skills.
+- bheem: reviewed Uma persistent memories and recurring workflow skills.
 - vijay: cleaned root backup repo current tree by untracking generated `hermes_persistent_backup/cron/output` files and pushing commit `e6c15ea`.
 - bheem: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal.
 - vijay: ran root restore rehearsal into `/tmp/hermes-restore-test-root`, verified portable restore content, and scanned restored config/template for common token patterns.
diff --git a/scripts/hermes-health-watchdog.py b/scripts/hermes-health-watchdog.py
index de0ce4b..7d25cf0 100755
--- a/scripts/hermes-health-watchdog.py
+++ b/scripts/hermes-health-watchdog.py
@@ -14,9 +14,15 @@ from datetime import datetime, timezone
 from pathlib import Path
 
 DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85"))
+MEMORY_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_MEMORY_WARN_PERCENT", "90"))
 BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90"))
 BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub")
 GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service")
+DOCKER_CONTAINERS = [
+    item.strip()
+    for item in os.getenv("HERMES_WATCHDOG_DOCKER_CONTAINERS", "caddy,gitea-npm-registry").split(",")
+    if item.strip()
+]
 HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes")))
 
 
@@ -73,9 +79,49 @@ def check_disk(alerts: list[str]) -> None:
         alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)")
 
 
+def check_memory(alerts: list[str]) -> None:
+    meminfo: dict[str, int] = {}
+    for line in Path("/proc/meminfo").read_text(encoding="utf-8").splitlines():
+        parts = line.split()
+        if len(parts) >= 2:
+            meminfo[parts[0].rstrip(":")] = int(parts[1])
+    total = meminfo.get("MemTotal", 0)
+    available = meminfo.get("MemAvailable", 0)
+    if total <= 0 or available <= 0:
+        alerts.append("could not read memory pressure from /proc/meminfo")
+        return
+    used_pct = int(round(((total - available) / total) * 100))
+    if used_pct >= MEMORY_WARN_PERCENT:
+        alerts.append(f"memory pressure is high: {used_pct}% used (threshold {MEMORY_WARN_PERCENT}%)")
+
+
+def check_docker_containers(alerts: list[str]) -> None:
+    if not DOCKER_CONTAINERS:
+        return
+    docker = shutil.which("docker")
+    if not docker:
+        alerts.append("docker executable not found; cannot verify critical containers")
+        return
+    result = run([docker, "ps", "--format", "{{.Names}}"], timeout=20)
+    if result.returncode != 0:
+        alerts.append(f"`docker ps` failed while checking critical containers: {result.stderr.strip() or result.stdout.strip()}")
+        return
+    running = set(result.stdout.splitlines())
+    missing = [name for name in DOCKER_CONTAINERS if name not in running]
+    if missing:
+        alerts.append(f"critical Docker container(s) not running: {', '.join(missing)}")
+
+
 def main() -> int:
     alerts: list[str] = []
-    for check in (check_gateway, check_backup_cron, check_backup_repo_freshness, check_disk):
+    for check in (
+        check_gateway,
+        check_backup_cron,
+        check_backup_repo_freshness,
+        check_disk,
+        check_memory,
+        check_docker_containers,
+    ):
         try:
             check(alerts)
         except Exception as exc:  # noqa: BLE001 - watchdog should alert, not crash silently
@@ -85,7 +131,10 @@ def main() -> int:
         print("🚨 ByteLyst Hermes watchdog alert")
         for item in alerts:
             print(f"- {item}")
-        print("\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, `hermes cron list`, `df -h /`.")
+        print(
+            "\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, "
+            "`hermes cron list`, `df -h /`, `free -h`, `docker ps`."
+        )
         return 0
     return 0
 
diff --git a/systemd/hermes-root-dashboard.service b/systemd/hermes-root-dashboard.service
new file mode 100644
index 0000000..96e1baf
--- /dev/null
+++ b/systemd/hermes-root-dashboard.service
@@ -0,0 +1,21 @@
+[Unit]
+Description=Root Hermes Dashboard (Tailscale private)
+After=network-online.target tailscaled.service
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=root
+Group=root
+WorkingDirectory=/usr/local/lib/hermes-agent
+Environment="HOME=/root"
+Environment="USER=root"
+Environment="LOGNAME=root"
+Environment="HERMES_HOME=/root/.hermes"
+Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9119 --no-open --insecure --skip-build
+Restart=always
+RestartSec=5
+
+[Install]
+WantedBy=multi-user.target
diff --git a/systemd/uma-hermes-dashboard.service b/systemd/uma-hermes-dashboard.service
new file mode 100644
index 0000000..b5db599
--- /dev/null
+++ b/systemd/uma-hermes-dashboard.service
@@ -0,0 +1,21 @@
+[Unit]
+Description=Uma Hermes Dashboard (Tailscale private)
+After=network-online.target tailscaled.service
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=uma
+Group=uma
+WorkingDirectory=/usr/local/lib/hermes-agent
+Environment="HOME=/home/uma"
+Environment="USER=uma"
+Environment="LOGNAME=uma"
+Environment="HERMES_HOME=/home/uma/.hermes"
+Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9120 --no-open --insecure --skip-build
+Restart=always
+RestartSec=5
+
+[Install]
+WantedBy=multi-user.target