diff --git a/docs/hermes-disaster-recovery.md b/docs/hermes-disaster-recovery.md new file mode 100644 index 0000000..5136a8f --- /dev/null +++ b/docs/hermes-disaster-recovery.md @@ -0,0 +1,239 @@ +# Hermes Disaster Recovery Runbook + +Goal: rebuild the ByteLyst root Hermes and Uma/Bheem Hermes setup on a new VM quickly, with durable memory, sessions, cron definitions, skills, scripts, and dashboard/service configuration restored from GitHub-backed artifacts. + +Last verified: 2026-05-27. + +## Current Recovery Confidence + +**High for durable Hermes state.** Both root and Uma now have sanitized `.hermes` persistent backups pushed to GitHub and recurring systemd backup timers. + +What is recoverable: + +- root Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB +- Uma Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB +- root and Uma gateway systemd unit definitions +- root and Uma private dashboard systemd unit definitions +- root and Uma backup timer systemd unit definitions +- Uma wrapper/memory/docs repo content +- root operational docs and rebuild knowledge in this repo + +What still requires operator-provided credentials or re-authentication: + +- GitHub token or credentials for clone/push if the new VM does not already have them +- OpenAI Codex OAuth/provider login +- Telegram bot/user credentials if not restored from an external secret manager +- Tailscale login for the new machine +- any optional provider/search/browser API keys + +What is intentionally not restored from git: + +- raw `.env` secret values +- Hermes `auth.json` +- raw `state.db`, SQLite WAL/SHM files, logs, cache directories, sandboxes, locks, and PIDs +- live OS processes or in-flight terminal commands that were running at the exact moment the VM was lost + +Expected data-loss window: + +- durable backups run every 10 minutes through systemd timers +- latest in-memory/live process activity since the last backup may need manual reconstruction from Telegram/GitHub context + +## Backup Sources + +| Instance | GitHub repo | Backup path | Recurring sync | +| --- | --- | --- | --- | +| root/vijay | `https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `hermes-root-backup.timer` every 10 minutes | +| Uma/bheem | `https://github.com/umadev0931/uma_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `uma-hermes-backup.timer` every 10 minutes | +| ops docs | `https://github.com/saravanakumardb/learning_ai_devops_tools.git` | `docs/`, `systemd/`, `scripts/` | pushed manually after changes | + +Latest verified commits on 2026-05-27: + +- root persistent backup: `d286a03` +- Uma persistent backup: `bbad574` +- ops docs/systemd templates: update after this runbook commit + +## Fast Rebuild Order + +### 1. Prepare Base VM + +Install the minimum system packages: + +```bash +apt-get update +apt-get install -y git curl rsync python3 python3-venv nodejs npm systemd +``` + +Create Uma if missing: + +```bash +id uma || useradd -m -s /bin/bash uma +loginctl enable-linger uma +``` + +### 2. Restore Git Access + +Root is the operator for both root and Uma repo pushes. + +Restore GitHub credentials for root without printing them: + +```bash +git config --global credential.helper store +chmod 700 /root +# Create /root/.git-credentials from the external secret source. +chmod 600 /root/.git-credentials +``` + +Then clone the three recovery repos: + +```bash +mkdir -p /root/repos /home/uma/repos +git clone https://github.com/saravanakumardb/learning_ai_devops_tools.git /root/repos/learning_ai_devops_tools +git clone https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git /root/repos/bytelyst_hostinger_hermes_vm +git clone https://github.com/umadev0931/uma_hostinger_hermes_vm.git /home/uma/repos/uma_hostinger_hermes_vm +chown -R uma:uma /home/uma/repos +``` + +### 3. Install Hermes Source + +Use the official Hermes source and the same shared install path: + +```bash +mkdir -p /usr/local/lib +git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent +cd /usr/local/lib/hermes-agent +python3 -m venv venv +./venv/bin/pip install -e . +``` + +If the repo provides a setup/update script in the future, prefer the official upstream instructions, then verify: + +```bash +/usr/local/lib/hermes-agent/venv/bin/hermes --version +``` + +### 4. Restore Root Hermes Persistent Data + +```bash +HERMES_HOME=/root/.hermes \ + /root/repos/bytelyst_hostinger_hermes_vm/restore_hermes_persistent_data.sh \ + /root/repos/bytelyst_hostinger_hermes_vm/hermes_persistent_backup +``` + +Re-enter secrets from the external source into `/root/.hermes/.env` or via Hermes auth flows. Do not copy secrets from docs or chat. + +Verify: + +```bash +HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix +HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list +``` + +### 5. Restore Uma Hermes Persistent Data + +```bash +mkdir -p /home/uma/.hermes +HERMES_HOME=/home/uma/.hermes \ + /home/uma/repos/uma_hostinger_hermes_vm/restore_hermes_persistent_data.sh \ + /home/uma/repos/uma_hostinger_hermes_vm/hermes_persistent_backup +chown -R uma:uma /home/uma/.hermes +``` + +Re-enter Uma secrets from the external source into `/home/uma/.hermes/.env` or via Hermes auth flows. + +Verify: + +```bash +sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix +sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list +``` + +### 6. Reinstall Systemd Units + +```bash +cp /root/repos/learning_ai_devops_tools/systemd/hermes-gateway.service /etc/systemd/system/hermes-gateway.service +cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-dashboard.service /etc/systemd/system/hermes-root-dashboard.service +cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-dashboard.service /etc/systemd/system/uma-hermes-dashboard.service +cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.service /etc/systemd/system/hermes-root-backup.service +cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.timer /etc/systemd/system/hermes-root-backup.timer +cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.service /etc/systemd/system/uma-hermes-backup.service +cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.timer /etc/systemd/system/uma-hermes-backup.timer +``` + +Install Uma user gateway: + +```bash +mkdir -p /home/uma/.config/systemd/user +cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-gateway.service /home/uma/.config/systemd/user/uma-hermes-gateway.service +chown -R uma:uma /home/uma/.config +``` + +Enable services: + +```bash +systemctl daemon-reload +systemctl enable --now hermes-gateway.service +systemctl enable --now hermes-root-backup.timer uma-hermes-backup.timer + +sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user daemon-reload +sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user enable --now uma-hermes-gateway.service +``` + +### 7. Reconnect Tailscale And Dashboards + +```bash +curl -fsSL https://tailscale.com/install.sh | sh +systemctl enable --now tailscaled +tailscale up +tailscale ip -4 +``` + +Update the dashboard service files if the new Tailscale IP differs from the old `100.87.53.10`, then: + +```bash +systemctl daemon-reload +systemctl enable --now hermes-root-dashboard.service uma-hermes-dashboard.service +``` + +### 8. Final Verification + +```bash +systemctl status hermes-gateway.service --no-pager +sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user status uma-hermes-gateway.service --no-pager +systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager +systemctl list-timers --all --no-pager | grep 'hermes.*backup' + +HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list +sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list + +python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py +HERMES_HOME=/home/uma/.hermes HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py +``` + +Telegram smoke tests: + +- send root Hermes: `Hi` +- send Uma/Bheem Hermes: `Hi` +- verify both reply without model-provider errors +- verify root and Uma dashboards return HTTP 200 on the current Tailscale IP/ports + +## Restore Test Evidence + +Root restore test on 2026-05-27: + +- restored into `/tmp/hermes-restore-test-root-current` +- `MANIFEST.json` source: `/root/.hermes` +- restored file count: `751` +- restored cron job count: `1` +- confirmed absent: `state.db`, `auth.json`, `logs/` + +Uma restore test on 2026-05-27: + +- restored into `/tmp/hermes-restore-test-uma` +- `MANIFEST.json` source: `/home/uma/.hermes` +- restored file count: `600` +- restored cron job count: `2` +- confirmed absent: `state.db`, `auth.json`, `logs/` + +## Hard Rule During Recovery + +Do not expose Hermes dashboard/API publicly during rebuild. Use only local shell, SSH tunnel, or Tailscale/private network unless S explicitly approves the hostname, authentication gate, and access path. diff --git a/docs/hermes-operations.md b/docs/hermes-operations.md index fe86659..090803a 100644 --- a/docs/hermes-operations.md +++ b/docs/hermes-operations.md @@ -15,6 +15,7 @@ Observed on 2026-05-27: - Uma Telegram gateway: `uma-hermes-gateway.service`, user service for `uma`, enabled and running - Root and Uma default model: `gpt-5.5`, `model.routing.enabled: false` - Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery +- Systemd persistent backup timers: `hermes-root-backup.timer` and `uma-hermes-backup.timer`, every 10 minutes - Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only - Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval - Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10` @@ -66,8 +67,14 @@ ss -ltnp | grep -E ':(9119|9120)' Tracked service unit templates: ```bash +systemd/hermes-gateway.service +systemd/uma-hermes-gateway.service systemd/hermes-root-dashboard.service systemd/uma-hermes-dashboard.service +systemd/hermes-root-backup.service +systemd/hermes-root-backup.timer +systemd/uma-hermes-backup.service +systemd/uma-hermes-backup.timer ``` ## Health baseline commands @@ -146,10 +153,19 @@ python3 ~/.hermes/scripts/hermes_health_watchdog.py # Healthy output should be empty. ``` +Persistent backup timers: + +```bash +systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager +systemctl list-timers --all --no-pager | grep 'hermes.*backup' +``` + ## Backup and restore drill outline The persistent-data backup repo intentionally excludes raw secrets and `state.db`. +For full VM rebuild steps, use `docs/hermes-disaster-recovery.md`. + Quarterly restore drill: 1. Run the backup sync manually or wait for a successful cron run. diff --git a/systemd/hermes-gateway.service b/systemd/hermes-gateway.service new file mode 100644 index 0000000..f9a5257 --- /dev/null +++ b/systemd/hermes-gateway.service @@ -0,0 +1,34 @@ +[Unit] +Description=Hermes Agent Gateway - Messaging Platform Integration +After=network-online.target +Wants=network-online.target +StartLimitIntervalSec=0 + +[Service] +Type=simple +User=root +Group=root +ExecStart=/usr/local/lib/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace +WorkingDirectory=/usr/local/lib/hermes-agent +Environment="HOME=/root" +Environment="USER=root" +Environment="LOGNAME=root" +Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/bin:/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" +Environment="VIRTUAL_ENV=/usr/local/lib/hermes-agent/venv" +Environment="HERMES_HOME=/root/.hermes" +Environment="HERMES_MODEL=gpt-5.5" +Environment="HERMES_INFERENCE_MODEL=gpt-5.5" +Restart=always +RestartSec=5 +RestartMaxDelaySec=300 +RestartSteps=5 +RestartForceExitStatus=75 +KillMode=mixed +KillSignal=SIGTERM +ExecReload=/bin/kill -USR1 $MAINPID +TimeoutStopSec=210 +StandardOutput=journal +StandardError=journal + +[Install] +WantedBy=multi-user.target diff --git a/systemd/hermes-root-backup.service b/systemd/hermes-root-backup.service new file mode 100644 index 0000000..4c2b182 --- /dev/null +++ b/systemd/hermes-root-backup.service @@ -0,0 +1,13 @@ +[Unit] +Description=Sync root Hermes persistent backup to GitHub +After=network-online.target +Wants=network-online.target + +[Service] +Type=oneshot +User=root +Group=root +Environment="HERMES_HOME=/root/.hermes" +Environment="HERMES_BACKUP_REPO=/root/repos/bytelyst_hostinger_hermes_vm" +Environment="HERMES_BACKUP_REMOTE=https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git" +ExecStart=/usr/bin/python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py diff --git a/systemd/hermes-root-backup.timer b/systemd/hermes-root-backup.timer new file mode 100644 index 0000000..6904bc7 --- /dev/null +++ b/systemd/hermes-root-backup.timer @@ -0,0 +1,12 @@ +[Unit] +Description=Run root Hermes persistent backup sync every 10 minutes + +[Timer] +OnBootSec=5min +OnUnitActiveSec=10min +AccuracySec=1min +Persistent=true +Unit=hermes-root-backup.service + +[Install] +WantedBy=timers.target diff --git a/systemd/uma-hermes-backup.service b/systemd/uma-hermes-backup.service new file mode 100644 index 0000000..becbdf9 --- /dev/null +++ b/systemd/uma-hermes-backup.service @@ -0,0 +1,13 @@ +[Unit] +Description=Sync Uma Hermes persistent backup to GitHub +After=network-online.target +Wants=network-online.target + +[Service] +Type=oneshot +User=root +Group=root +Environment="HERMES_HOME=/home/uma/.hermes" +Environment="HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm" +Environment="HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git" +ExecStart=/usr/bin/python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py diff --git a/systemd/uma-hermes-backup.timer b/systemd/uma-hermes-backup.timer new file mode 100644 index 0000000..695268f --- /dev/null +++ b/systemd/uma-hermes-backup.timer @@ -0,0 +1,12 @@ +[Unit] +Description=Run Uma Hermes persistent backup sync every 10 minutes + +[Timer] +OnBootSec=5min +OnUnitActiveSec=10min +AccuracySec=1min +Persistent=true +Unit=uma-hermes-backup.service + +[Install] +WantedBy=timers.target diff --git a/systemd/uma-hermes-gateway.service b/systemd/uma-hermes-gateway.service new file mode 100644 index 0000000..5a1245e --- /dev/null +++ b/systemd/uma-hermes-gateway.service @@ -0,0 +1,29 @@ +[Unit] +Description=Uma Hermes Gateway - Telegram Integration +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +WorkingDirectory=/usr/local/lib/hermes-agent +Environment="HOME=/home/uma" +Environment="USER=uma" +Environment="LOGNAME=uma" +Environment="HERMES_HOME=/home/uma/.hermes" +Environment="HERMES_MODEL=gpt-5.5" +Environment="HERMES_INFERENCE_MODEL=gpt-5.5" +Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" +Environment="VIRTUAL_ENV=/usr/local/lib/hermes-agent/venv" +ExecStart=/usr/local/lib/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace +Restart=always +RestartSec=5 +RestartMaxDelaySec=300 +RestartSteps=5 +RestartForceExitStatus=75 +KillMode=mixed +KillSignal=SIGTERM +ExecReload=/bin/kill -USR1 $MAINPID +TimeoutStopSec=210 + +[Install] +WantedBy=default.target