Add Hermes disaster recovery runbook
This commit is contained in:
parent
ccd6ee4f7f
commit
19fdba752c
239
docs/hermes-disaster-recovery.md
Normal file
239
docs/hermes-disaster-recovery.md
Normal file
@ -0,0 +1,239 @@
|
||||
# Hermes Disaster Recovery Runbook
|
||||
|
||||
Goal: rebuild the ByteLyst root Hermes and Uma/Bheem Hermes setup on a new VM quickly, with durable memory, sessions, cron definitions, skills, scripts, and dashboard/service configuration restored from GitHub-backed artifacts.
|
||||
|
||||
Last verified: 2026-05-27.
|
||||
|
||||
## Current Recovery Confidence
|
||||
|
||||
**High for durable Hermes state.** Both root and Uma now have sanitized `.hermes` persistent backups pushed to GitHub and recurring systemd backup timers.
|
||||
|
||||
What is recoverable:
|
||||
|
||||
- root Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
|
||||
- Uma Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
|
||||
- root and Uma gateway systemd unit definitions
|
||||
- root and Uma private dashboard systemd unit definitions
|
||||
- root and Uma backup timer systemd unit definitions
|
||||
- Uma wrapper/memory/docs repo content
|
||||
- root operational docs and rebuild knowledge in this repo
|
||||
|
||||
What still requires operator-provided credentials or re-authentication:
|
||||
|
||||
- GitHub token or credentials for clone/push if the new VM does not already have them
|
||||
- OpenAI Codex OAuth/provider login
|
||||
- Telegram bot/user credentials if not restored from an external secret manager
|
||||
- Tailscale login for the new machine
|
||||
- any optional provider/search/browser API keys
|
||||
|
||||
What is intentionally not restored from git:
|
||||
|
||||
- raw `.env` secret values
|
||||
- Hermes `auth.json`
|
||||
- raw `state.db`, SQLite WAL/SHM files, logs, cache directories, sandboxes, locks, and PIDs
|
||||
- live OS processes or in-flight terminal commands that were running at the exact moment the VM was lost
|
||||
|
||||
Expected data-loss window:
|
||||
|
||||
- durable backups run every 10 minutes through systemd timers
|
||||
- latest in-memory/live process activity since the last backup may need manual reconstruction from Telegram/GitHub context
|
||||
|
||||
## Backup Sources
|
||||
|
||||
| Instance | GitHub repo | Backup path | Recurring sync |
|
||||
| --- | --- | --- | --- |
|
||||
| root/vijay | `https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `hermes-root-backup.timer` every 10 minutes |
|
||||
| Uma/bheem | `https://github.com/umadev0931/uma_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `uma-hermes-backup.timer` every 10 minutes |
|
||||
| ops docs | `https://github.com/saravanakumardb/learning_ai_devops_tools.git` | `docs/`, `systemd/`, `scripts/` | pushed manually after changes |
|
||||
|
||||
Latest verified commits on 2026-05-27:
|
||||
|
||||
- root persistent backup: `d286a03`
|
||||
- Uma persistent backup: `bbad574`
|
||||
- ops docs/systemd templates: update after this runbook commit
|
||||
|
||||
## Fast Rebuild Order
|
||||
|
||||
### 1. Prepare Base VM
|
||||
|
||||
Install the minimum system packages:
|
||||
|
||||
```bash
|
||||
apt-get update
|
||||
apt-get install -y git curl rsync python3 python3-venv nodejs npm systemd
|
||||
```
|
||||
|
||||
Create Uma if missing:
|
||||
|
||||
```bash
|
||||
id uma || useradd -m -s /bin/bash uma
|
||||
loginctl enable-linger uma
|
||||
```
|
||||
|
||||
### 2. Restore Git Access
|
||||
|
||||
Root is the operator for both root and Uma repo pushes.
|
||||
|
||||
Restore GitHub credentials for root without printing them:
|
||||
|
||||
```bash
|
||||
git config --global credential.helper store
|
||||
chmod 700 /root
|
||||
# Create /root/.git-credentials from the external secret source.
|
||||
chmod 600 /root/.git-credentials
|
||||
```
|
||||
|
||||
Then clone the three recovery repos:
|
||||
|
||||
```bash
|
||||
mkdir -p /root/repos /home/uma/repos
|
||||
git clone https://github.com/saravanakumardb/learning_ai_devops_tools.git /root/repos/learning_ai_devops_tools
|
||||
git clone https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git /root/repos/bytelyst_hostinger_hermes_vm
|
||||
git clone https://github.com/umadev0931/uma_hostinger_hermes_vm.git /home/uma/repos/uma_hostinger_hermes_vm
|
||||
chown -R uma:uma /home/uma/repos
|
||||
```
|
||||
|
||||
### 3. Install Hermes Source
|
||||
|
||||
Use the official Hermes source and the same shared install path:
|
||||
|
||||
```bash
|
||||
mkdir -p /usr/local/lib
|
||||
git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent
|
||||
cd /usr/local/lib/hermes-agent
|
||||
python3 -m venv venv
|
||||
./venv/bin/pip install -e .
|
||||
```
|
||||
|
||||
If the repo provides a setup/update script in the future, prefer the official upstream instructions, then verify:
|
||||
|
||||
```bash
|
||||
/usr/local/lib/hermes-agent/venv/bin/hermes --version
|
||||
```
|
||||
|
||||
### 4. Restore Root Hermes Persistent Data
|
||||
|
||||
```bash
|
||||
HERMES_HOME=/root/.hermes \
|
||||
/root/repos/bytelyst_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
|
||||
/root/repos/bytelyst_hostinger_hermes_vm/hermes_persistent_backup
|
||||
```
|
||||
|
||||
Re-enter secrets from the external source into `/root/.hermes/.env` or via Hermes auth flows. Do not copy secrets from docs or chat.
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
|
||||
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
|
||||
```
|
||||
|
||||
### 5. Restore Uma Hermes Persistent Data
|
||||
|
||||
```bash
|
||||
mkdir -p /home/uma/.hermes
|
||||
HERMES_HOME=/home/uma/.hermes \
|
||||
/home/uma/repos/uma_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
|
||||
/home/uma/repos/uma_hostinger_hermes_vm/hermes_persistent_backup
|
||||
chown -R uma:uma /home/uma/.hermes
|
||||
```
|
||||
|
||||
Re-enter Uma secrets from the external source into `/home/uma/.hermes/.env` or via Hermes auth flows.
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
|
||||
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
|
||||
```
|
||||
|
||||
### 6. Reinstall Systemd Units
|
||||
|
||||
```bash
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/hermes-gateway.service /etc/systemd/system/hermes-gateway.service
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-dashboard.service /etc/systemd/system/hermes-root-dashboard.service
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-dashboard.service /etc/systemd/system/uma-hermes-dashboard.service
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.service /etc/systemd/system/hermes-root-backup.service
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.timer /etc/systemd/system/hermes-root-backup.timer
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.service /etc/systemd/system/uma-hermes-backup.service
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.timer /etc/systemd/system/uma-hermes-backup.timer
|
||||
```
|
||||
|
||||
Install Uma user gateway:
|
||||
|
||||
```bash
|
||||
mkdir -p /home/uma/.config/systemd/user
|
||||
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-gateway.service /home/uma/.config/systemd/user/uma-hermes-gateway.service
|
||||
chown -R uma:uma /home/uma/.config
|
||||
```
|
||||
|
||||
Enable services:
|
||||
|
||||
```bash
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now hermes-gateway.service
|
||||
systemctl enable --now hermes-root-backup.timer uma-hermes-backup.timer
|
||||
|
||||
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user daemon-reload
|
||||
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user enable --now uma-hermes-gateway.service
|
||||
```
|
||||
|
||||
### 7. Reconnect Tailscale And Dashboards
|
||||
|
||||
```bash
|
||||
curl -fsSL https://tailscale.com/install.sh | sh
|
||||
systemctl enable --now tailscaled
|
||||
tailscale up
|
||||
tailscale ip -4
|
||||
```
|
||||
|
||||
Update the dashboard service files if the new Tailscale IP differs from the old `100.87.53.10`, then:
|
||||
|
||||
```bash
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now hermes-root-dashboard.service uma-hermes-dashboard.service
|
||||
```
|
||||
|
||||
### 8. Final Verification
|
||||
|
||||
```bash
|
||||
systemctl status hermes-gateway.service --no-pager
|
||||
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user status uma-hermes-gateway.service --no-pager
|
||||
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
|
||||
systemctl list-timers --all --no-pager | grep 'hermes.*backup'
|
||||
|
||||
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
|
||||
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
|
||||
|
||||
python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py
|
||||
HERMES_HOME=/home/uma/.hermes HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py
|
||||
```
|
||||
|
||||
Telegram smoke tests:
|
||||
|
||||
- send root Hermes: `Hi`
|
||||
- send Uma/Bheem Hermes: `Hi`
|
||||
- verify both reply without model-provider errors
|
||||
- verify root and Uma dashboards return HTTP 200 on the current Tailscale IP/ports
|
||||
|
||||
## Restore Test Evidence
|
||||
|
||||
Root restore test on 2026-05-27:
|
||||
|
||||
- restored into `/tmp/hermes-restore-test-root-current`
|
||||
- `MANIFEST.json` source: `/root/.hermes`
|
||||
- restored file count: `751`
|
||||
- restored cron job count: `1`
|
||||
- confirmed absent: `state.db`, `auth.json`, `logs/`
|
||||
|
||||
Uma restore test on 2026-05-27:
|
||||
|
||||
- restored into `/tmp/hermes-restore-test-uma`
|
||||
- `MANIFEST.json` source: `/home/uma/.hermes`
|
||||
- restored file count: `600`
|
||||
- restored cron job count: `2`
|
||||
- confirmed absent: `state.db`, `auth.json`, `logs/`
|
||||
|
||||
## Hard Rule During Recovery
|
||||
|
||||
Do not expose Hermes dashboard/API publicly during rebuild. Use only local shell, SSH tunnel, or Tailscale/private network unless S explicitly approves the hostname, authentication gate, and access path.
|
||||
@ -15,6 +15,7 @@ Observed on 2026-05-27:
|
||||
- Uma Telegram gateway: `uma-hermes-gateway.service`, user service for `uma`, enabled and running
|
||||
- Root and Uma default model: `gpt-5.5`, `model.routing.enabled: false`
|
||||
- Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery
|
||||
- Systemd persistent backup timers: `hermes-root-backup.timer` and `uma-hermes-backup.timer`, every 10 minutes
|
||||
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
|
||||
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
|
||||
- Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10`
|
||||
@ -66,8 +67,14 @@ ss -ltnp | grep -E ':(9119|9120)'
|
||||
Tracked service unit templates:
|
||||
|
||||
```bash
|
||||
systemd/hermes-gateway.service
|
||||
systemd/uma-hermes-gateway.service
|
||||
systemd/hermes-root-dashboard.service
|
||||
systemd/uma-hermes-dashboard.service
|
||||
systemd/hermes-root-backup.service
|
||||
systemd/hermes-root-backup.timer
|
||||
systemd/uma-hermes-backup.service
|
||||
systemd/uma-hermes-backup.timer
|
||||
```
|
||||
|
||||
## Health baseline commands
|
||||
@ -146,10 +153,19 @@ python3 ~/.hermes/scripts/hermes_health_watchdog.py
|
||||
# Healthy output should be empty.
|
||||
```
|
||||
|
||||
Persistent backup timers:
|
||||
|
||||
```bash
|
||||
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
|
||||
systemctl list-timers --all --no-pager | grep 'hermes.*backup'
|
||||
```
|
||||
|
||||
## Backup and restore drill outline
|
||||
|
||||
The persistent-data backup repo intentionally excludes raw secrets and `state.db`.
|
||||
|
||||
For full VM rebuild steps, use `docs/hermes-disaster-recovery.md`.
|
||||
|
||||
Quarterly restore drill:
|
||||
|
||||
1. Run the backup sync manually or wait for a successful cron run.
|
||||
|
||||
34
systemd/hermes-gateway.service
Normal file
34
systemd/hermes-gateway.service
Normal file
@ -0,0 +1,34 @@
|
||||
[Unit]
|
||||
Description=Hermes Agent Gateway - Messaging Platform Integration
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
StartLimitIntervalSec=0
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=root
|
||||
Group=root
|
||||
ExecStart=/usr/local/lib/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
|
||||
WorkingDirectory=/usr/local/lib/hermes-agent
|
||||
Environment="HOME=/root"
|
||||
Environment="USER=root"
|
||||
Environment="LOGNAME=root"
|
||||
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/bin:/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||||
Environment="VIRTUAL_ENV=/usr/local/lib/hermes-agent/venv"
|
||||
Environment="HERMES_HOME=/root/.hermes"
|
||||
Environment="HERMES_MODEL=gpt-5.5"
|
||||
Environment="HERMES_INFERENCE_MODEL=gpt-5.5"
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
RestartMaxDelaySec=300
|
||||
RestartSteps=5
|
||||
RestartForceExitStatus=75
|
||||
KillMode=mixed
|
||||
KillSignal=SIGTERM
|
||||
ExecReload=/bin/kill -USR1 $MAINPID
|
||||
TimeoutStopSec=210
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
13
systemd/hermes-root-backup.service
Normal file
13
systemd/hermes-root-backup.service
Normal file
@ -0,0 +1,13 @@
|
||||
[Unit]
|
||||
Description=Sync root Hermes persistent backup to GitHub
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=root
|
||||
Group=root
|
||||
Environment="HERMES_HOME=/root/.hermes"
|
||||
Environment="HERMES_BACKUP_REPO=/root/repos/bytelyst_hostinger_hermes_vm"
|
||||
Environment="HERMES_BACKUP_REMOTE=https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git"
|
||||
ExecStart=/usr/bin/python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py
|
||||
12
systemd/hermes-root-backup.timer
Normal file
12
systemd/hermes-root-backup.timer
Normal file
@ -0,0 +1,12 @@
|
||||
[Unit]
|
||||
Description=Run root Hermes persistent backup sync every 10 minutes
|
||||
|
||||
[Timer]
|
||||
OnBootSec=5min
|
||||
OnUnitActiveSec=10min
|
||||
AccuracySec=1min
|
||||
Persistent=true
|
||||
Unit=hermes-root-backup.service
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
13
systemd/uma-hermes-backup.service
Normal file
13
systemd/uma-hermes-backup.service
Normal file
@ -0,0 +1,13 @@
|
||||
[Unit]
|
||||
Description=Sync Uma Hermes persistent backup to GitHub
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=root
|
||||
Group=root
|
||||
Environment="HERMES_HOME=/home/uma/.hermes"
|
||||
Environment="HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm"
|
||||
Environment="HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git"
|
||||
ExecStart=/usr/bin/python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py
|
||||
12
systemd/uma-hermes-backup.timer
Normal file
12
systemd/uma-hermes-backup.timer
Normal file
@ -0,0 +1,12 @@
|
||||
[Unit]
|
||||
Description=Run Uma Hermes persistent backup sync every 10 minutes
|
||||
|
||||
[Timer]
|
||||
OnBootSec=5min
|
||||
OnUnitActiveSec=10min
|
||||
AccuracySec=1min
|
||||
Persistent=true
|
||||
Unit=uma-hermes-backup.service
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
29
systemd/uma-hermes-gateway.service
Normal file
29
systemd/uma-hermes-gateway.service
Normal file
@ -0,0 +1,29 @@
|
||||
[Unit]
|
||||
Description=Uma Hermes Gateway - Telegram Integration
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
WorkingDirectory=/usr/local/lib/hermes-agent
|
||||
Environment="HOME=/home/uma"
|
||||
Environment="USER=uma"
|
||||
Environment="LOGNAME=uma"
|
||||
Environment="HERMES_HOME=/home/uma/.hermes"
|
||||
Environment="HERMES_MODEL=gpt-5.5"
|
||||
Environment="HERMES_INFERENCE_MODEL=gpt-5.5"
|
||||
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||||
Environment="VIRTUAL_ENV=/usr/local/lib/hermes-agent/venv"
|
||||
ExecStart=/usr/local/lib/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
RestartMaxDelaySec=300
|
||||
RestartSteps=5
|
||||
RestartForceExitStatus=75
|
||||
KillMode=mixed
|
||||
KillSignal=SIGTERM
|
||||
ExecReload=/bin/kill -USR1 $MAINPID
|
||||
TimeoutStopSec=210
|
||||
|
||||
[Install]
|
||||
WantedBy=default.target
|
||||
Loading…
Reference in New Issue
Block a user