bytelyst-devops-tools/docs/hermes-operations.md

197 lines
6.2 KiB
Markdown

# ByteLyst Hermes Operations Runbook
Operational runbook for the private Telegram-driven Hermes Agent setup on the ByteLyst VM.
## Current baseline
Observed on 2026-05-27:
- Hermes version: `v0.14.0 (2026.5.16)`
- Shared source checkout: `/usr/local/lib/hermes-agent` at upstream `0b6ace649` after the 2026-05-27 late upgrade pass
- Install path: `/usr/local/lib/hermes-agent`
- Active profile: `default`
- Primary provider: OpenAI Codex OAuth
- Root Telegram gateway: `hermes-gateway.service`, system service, enabled and running
- Uma Telegram gateway: `uma-hermes-gateway.service`, user service for `uma`, enabled and running
- Root and Uma default model: `gpt-5.5`, `model.routing.enabled: false`
- Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
- Tailscale: installed and `tailscaled` enabled/running; login intentionally deferred until the operator can authenticate the node
## Safety guardrail: no public Hermes dashboard/API
Before adding any new Caddy hostname, Docker port, or dashboard/API feature, verify that it is not a Hermes dashboard/API public exposure.
```bash
# Inspect public Caddy routes and obvious Hermes/API/dashboard references.
docker ps --format '{{.Names}} {{.Ports}}' | grep -i caddy || true
grep -RniE 'hermes|dashboard|api-server|API_SERVER|8000|8080|3000|5173' /etc/caddy /root/bytelyst.ai 2>/dev/null | head -100
# Inspect listening ports. Review any 0.0.0.0 listeners before exposing a hostname.
ss -ltnp
```
Allowed private access patterns for a future Hermes dashboard:
1. local-only binding (`127.0.0.1`)
2. SSH tunnel
3. Tailscale/WireGuard private network
4. Cloudflare Access or equivalent identity gate
5. basic auth plus IP allowlist only if public routing is unavoidable and explicitly approved
## Health baseline commands
```bash
hermes --version
hermes config check
hermes doctor --fix
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
df -h /
free -h
ss -ltnp
```
Notes:
- `hermes doctor --fix` migrated root and Uma configs to version `24` on 2026-05-27.
- Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials.
## Gateway recovery
```bash
systemctl status hermes-gateway --no-pager
journalctl -u hermes-gateway -n 100 --no-pager
hermes gateway restart
# If the CLI restart path is unavailable:
sudo systemctl restart hermes-gateway
# Uma user gateway:
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 journalctl --user -u uma-hermes-gateway -n 100 --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user restart uma-hermes-gateway
```
After restart, verify from Telegram:
- inbound message receives a response
- outbound completion messages work
- approval prompts still reach the allowed user
- media/file delivery works for a known safe file if needed
## Cron and watchdogs
List jobs:
```bash
hermes cron list
```
Current watchdog script:
```bash
~/.hermes/scripts/hermes_health_watchdog.py
```
Tracked source copy:
```bash
scripts/hermes-health-watchdog.py
```
Behavior:
- no output on success, so the cron stays silent
- sends a Telegram message only when it detects an actionable failure
- checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
Manual smoke test:
```bash
python3 ~/.hermes/scripts/hermes_health_watchdog.py
# Healthy output should be empty.
```
## Backup and restore drill outline
The persistent-data backup repo intentionally excludes raw secrets and `state.db`.
Quarterly restore drill:
1. Run the backup sync manually or wait for a successful cron run.
2. Clone the backup repo into a temporary directory.
3. Inspect git contents for accidental raw secrets:
```bash
git grep -nE '(API_KEY|TOKEN|SECRET|PASSWORD|BEGIN .*PRIVATE KEY)' || true
```
4. Restore into a non-production Hermes profile/test directory only.
5. Verify config, skills, sessions JSON exports, cron definitions, memories, and scripts are present.
6. Confirm `.env`, OAuth files, SQLite WAL/SHM files, logs, caches, and raw `state.db` are absent.
7. Delete the temporary restore directory when done.
2026-05-27 restore rehearsal:
- Restored root backup into `/tmp/hermes-restore-test-root`.
- Verified portable directories/files were present: `config.yaml`, `skills/`, `sessions/`, `cron/`, `memories/`, and scripts.
- Verified raw `state.db` was absent.
- Scanned restored `.env` template and `config.yaml` for common token patterns; no hits.
## Upgrade checklist
Before upgrade:
```bash
hermes --version
hermes status --all
hermes config check
hermes cron list
python3 ~/.hermes/scripts/sync_hermes_persistent_backup.py
```
Upgrade from an interactive/private shell only:
```bash
hermes update
```
After upgrade:
```bash
hermes doctor --fix
hermes gateway restart
hermes --version
hermes status --all
hermes cron list
python3 ~/.hermes/scripts/hermes_health_watchdog.py
```
Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap.
2026-05-27 late upgrade pass:
- Backed up root/Uma configs and service units under `/root/hermes-fix-backups/20260527-roadmap-noncreds/`.
- Fast-forwarded `/usr/local/lib/hermes-agent` to upstream `0b6ace649`.
- Restarted both gateways.
- Verified provider smoke tests with exact responses `root-roadmap-ok` and `uma-roadmap-ok`.
## Provider and tool changes
Use Hermes flows rather than editing secrets into git-tracked files:
```bash
hermes model
hermes setup model
hermes tools list
hermes tools enable <toolset>
hermes tools disable <toolset>
```
Restart/reset requirement:
- gateway config changes: `/restart` from Telegram or `hermes gateway restart`
- CLI session tool changes: start a new session or `/reset`
- provider auth changes: start a new session after switching models/providers