bytelyst/bytelyst-devops-tools

Fork 0

root 8de72351de Complete Hermes dashboard and watchdog roadmap audit

2026-05-27 10:45:29 +00:00

8.1 KiB

Raw Blame History

ByteLyst Hermes Operations Runbook

Operational runbook for the private Telegram-driven Hermes Agent setup on the ByteLyst VM.

Current baseline

Observed on 2026-05-27:

Hermes version: v0.14.0 (2026.5.16)
Shared source checkout: /usr/local/lib/hermes-agent at upstream 0b6ace649 after the 2026-05-27 late upgrade pass
Install path: /usr/local/lib/hermes-agent
Active profile: default
Primary provider: OpenAI Codex OAuth
Root Telegram gateway: hermes-gateway.service, system service, enabled and running
Uma Telegram gateway: uma-hermes-gateway.service, user service for uma, enabled and running
Root and Uma default model: gpt-5.5, model.routing.enabled: false
Backup cron: Sync Hermes persistent-data backup to GitHub, every 30 minutes, local delivery
Watchdog cron: ByteLyst Hermes gateway/backup/disk watchdog, every 15 minutes, Telegram delivery on failure only
Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
Tailscale: installed and tailscaled enabled/running; authenticated as tailnet IP 100.87.53.10
Private dashboards:
- Root: http://100.87.53.10:9119/, hermes-root-dashboard.service
- Uma: http://100.87.53.10:9120/, uma-hermes-dashboard.service

Safety guardrail: no public Hermes dashboard/API

Before adding any new Caddy hostname, Docker port, or dashboard/API feature, verify that it is not a Hermes dashboard/API public exposure.

# Inspect public Caddy routes and obvious Hermes/API/dashboard references.
docker ps --format '{{.Names}} {{.Ports}}' | grep -i caddy || true
grep -RniE 'hermes|dashboard|api-server|API_SERVER|8000|8080|3000|5173' /etc/caddy /root/bytelyst.ai 2>/dev/null | head -100

# Inspect listening ports. Review any 0.0.0.0 listeners before exposing a hostname.
ss -ltnp

Allowed private access patterns for a future Hermes dashboard:

local-only binding (127.0.0.1)
SSH tunnel
Tailscale/WireGuard private network
Cloudflare Access or equivalent identity gate
basic auth plus IP allowlist only if public routing is unavoidable and explicitly approved

Current private network access:

tailscale status
tailscale ip -4
# Expected server IPv4: 100.87.53.10

Private dashboard services:

systemctl status hermes-root-dashboard --no-pager
systemctl status uma-hermes-dashboard --no-pager
ss -ltnp | grep -E ':(9119|9120)'

# Expected listeners are Tailscale-only:
# 100.87.53.10:9119
# 100.87.53.10:9120

Tracked service unit templates:

systemd/hermes-root-dashboard.service
systemd/uma-hermes-dashboard.service

Health baseline commands

hermes --version
hermes config check
hermes doctor --fix
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
df -h /
free -h
ss -ltnp

Notes:

hermes doctor --fix migrated root and Uma configs to version 24 on 2026-05-27.
Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials.

Gateway recovery

systemctl status hermes-gateway --no-pager
journalctl -u hermes-gateway -n 100 --no-pager
hermes gateway restart
# If the CLI restart path is unavailable:
sudo systemctl restart hermes-gateway

# Uma user gateway:
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 journalctl --user -u uma-hermes-gateway -n 100 --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user restart uma-hermes-gateway

After restart, verify from Telegram:

inbound message receives a response
outbound completion messages work
approval prompts still reach the allowed user
media/file delivery works for a known safe file if needed

Cron and watchdogs

List jobs:

hermes cron list

Current watchdog script:

~/.hermes/scripts/hermes_health_watchdog.py

Tracked source copy:

scripts/hermes-health-watchdog.py

Behavior:

no output on success, so the cron stays silent
sends a Telegram message only when it detects an actionable failure
checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
also checks memory pressure plus critical Caddy/Gitea Docker containers (caddy, gitea-npm-registry)

Manual smoke test:

python3 ~/.hermes/scripts/hermes_health_watchdog.py
# Healthy output should be empty.

Backup and restore drill outline

The persistent-data backup repo intentionally excludes raw secrets and state.db.

Quarterly restore drill:

Run the backup sync manually or wait for a successful cron run.
Clone the backup repo into a temporary directory.

Inspect git contents for accidental raw secrets:

git grep -nE '(API_KEY|TOKEN|SECRET|PASSWORD|BEGIN .*PRIVATE KEY)' || true

Restore into a non-production Hermes profile/test directory only.
Verify config, skills, sessions JSON exports, cron definitions, memories, and scripts are present.
Confirm .env, OAuth files, SQLite WAL/SHM files, logs, caches, and raw state.db are absent.
Delete the temporary restore directory when done.

2026-05-27 restore rehearsal:

Restored root backup into /tmp/hermes-restore-test-root.
Verified portable directories/files were present: config.yaml, skills/, sessions/, cron/, memories/, and scripts.
Verified raw state.db was absent.
Scanned restored .env template and config.yaml for common token patterns; no hits.

Upgrade checklist

Before upgrade:

hermes --version
hermes status --all
hermes config check
hermes cron list
python3 ~/.hermes/scripts/sync_hermes_persistent_backup.py

Upgrade from an interactive/private shell only:

hermes update

After upgrade:

hermes doctor --fix
hermes gateway restart
hermes --version
hermes status --all
hermes cron list
python3 ~/.hermes/scripts/hermes_health_watchdog.py

Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap.

2026-05-27 late upgrade pass:

Backed up root/Uma configs and service units under /root/hermes-fix-backups/20260527-roadmap-noncreds/.
Fast-forwarded /usr/local/lib/hermes-agent to upstream 0b6ace649.
Restarted both gateways.
Verified provider smoke tests with exact responses root-roadmap-ok and uma-roadmap-ok.

Provider and tool changes

Use Hermes flows rather than editing secrets into git-tracked files:

hermes model
hermes setup model
hermes tools list
hermes tools enable <toolset>
hermes tools disable <toolset>

Restart/reset requirement:

gateway config changes: /restart from Telegram or hermes gateway restart
CLI session tool changes: start a new session or /reset
provider auth changes: start a new session after switching models/providers

Telegram topics and session handling

Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need.

Review these before changing Telegram routing:

systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100

Multi-agent execution conventions

Use the smallest execution surface that fits the task:

direct tool call: one-shot local checks, edits, commits, pushes, status reads
delegate_task: bounded research or code inspection that can return inside the parent session
background terminal process: long-running local commands that need monitoring
cron job: recurring, deterministic, silent-on-success maintenance
Kanban worker: durable multi-agent project coordination after the board is intentionally configured

Telegram progress/completion updates should keep the user's numbered-prefix convention (1, 2, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable.

8.1 KiB Raw Blame History