12 KiB
Hermes Disaster Recovery Runbook
Goal: rebuild the ByteLyst root Hermes and Uma/Bheem Hermes setup on a new VM quickly, with durable memory, sessions, cron definitions, skills, scripts, and dashboard/service configuration restored from GitHub-backed artifacts.
Last verified: 2026-05-27.
Current Recovery Confidence
High for durable Hermes state. Both root and Uma now have sanitized .hermes persistent backups pushed to GitHub and recurring systemd backup timers.
What is recoverable:
- root Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
- Uma Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
- root and Uma gateway systemd unit definitions
- root and Uma private dashboard systemd unit definitions
- root and Uma backup timer systemd unit definitions
- Uma wrapper/memory/docs repo content
- root operational docs and rebuild knowledge in this repo
What still requires operator-provided credentials or re-authentication:
- GitHub token or credentials for clone/push if the new VM does not already have them
- OpenAI Codex OAuth/provider login, unless restored from an encrypted emergency bundle
- Telegram bot/user credentials, unless restored from an encrypted emergency bundle
- Tailscale login for the new machine, unless restoring Tailscale state is explicitly chosen
- any optional provider/search/browser API keys
What is intentionally not restored from git:
- raw
.envsecret values - Hermes
auth.json - raw
state.db, SQLite WAL/SHM files, logs, cache directories, sandboxes, locks, and PIDs - live OS processes or in-flight terminal commands that were running at the exact moment the VM was lost
Expected data-loss window:
- durable backups run every 10 minutes through systemd timers
- latest in-memory/live process activity since the last backup may need manual reconstruction from Telegram/GitHub context
Backup Sources
| Instance | GitHub repo | Backup path | Recurring sync |
|---|---|---|---|
| root/vijay | https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git |
hermes_persistent_backup/ |
hermes-root-backup.timer every 10 minutes |
| Uma/bheem | https://github.com/umadev0931/uma_hostinger_hermes_vm.git |
hermes_persistent_backup/ |
uma-hermes-backup.timer every 10 minutes |
| ops docs | https://github.com/saravanakumardb/learning_ai_devops_tools.git |
docs/, systemd/, scripts/ |
pushed manually after changes |
Encrypted Emergency Bundle
Normal GitHub backups are sanitized and intentionally exclude raw secrets, auth state, and raw state.db. For faster break-glass recovery, create a separate encrypted bundle and store the encrypted .gpg file in Google Drive or another private location.
Create bundle on the old/current VM:
/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-create.sh
The script creates:
/root/hermes-emergency-bundles/hermes-emergency-bundle-<host>-<timestamp>.tar.zst.gpg
It includes an allow-list only:
/root/.hermes/.env,auth.json,state.db*/home/uma/.hermes/.env,auth.json,state.db*/root/.git-credentials/root/.gitea_admin_password,/root/.gitea_npm_token,/root/.gitea_npm_token_home/var/lib/tailscale/tailscaled.state
It does not include logs, caches, locks, PIDs, or sandboxes.
Decrypt on a new VM into staging only:
/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-decrypt.sh \
/path/to/hermes-emergency-bundle.tar.zst.gpg
The decrypt script extracts to /root/hermes-emergency-restore-staging/... by default. It does not overwrite live .hermes or credential files. Inspect the staging directory first, then manually copy only the files needed for the recovery.
For unattended operation, both scripts support:
export BUNDLE_PASSPHRASE_FILE=/root/path/to/passphrase-file
Keep the passphrase outside GitHub and outside the encrypted bundle.
Automated Google Drive upload for personal Drive uses OAuth user credentials, not the service account.
Why: service accounts can read metadata for folders shared from personal Drive, but personal Drive uploads fail because service accounts do not have personal Drive storage quota. Use the service account path only for Shared Drives or Workspace delegation.
Personal Drive OAuth setup:
-
In Google Cloud Console, create an OAuth client of type Desktop app in the
hermes-emergency-backupsproject. -
Save the downloaded JSON as:
/root/.config/hermes-google-drive/oauth-client.json -
Run:
/root/.local/share/hermes-drive-uploader-venv/bin/python \ /root/repos/learning_ai_devops_tools/scripts/hermes-google-drive-oauth-login.py -
Open the printed URL, approve access, paste the code back in the terminal.
-
Confirm
/root/.config/hermes-google-drive/user-token.jsonexists with mode600.
Automated Google Drive upload is configured to use:
- OAuth client:
/root/.config/hermes-google-drive/oauth-client.json - OAuth token:
/root/.config/hermes-google-drive/user-token.json - passphrase file:
/root/.config/hermes-google-drive/bundle-passphrase - uploader venv:
/root/.local/share/hermes-drive-uploader-venv - uploader script:
scripts/hermes-emergency-bundle-upload-drive.sh - timer:
hermes-emergency-drive-upload.timer, daily around03:17 UTC
Drive targets:
- Vijay folder:
1KIlSJzpf5fuaH5LYvfbLsUbOSYY23YGm - Bheem folder:
1Ac5cbDC0dSWas8LeeWe_9XFqCquz7kZT
The uploader creates one encrypted bundle and uploads the same encrypted file to both folders. It keeps the latest 12 encrypted bundles per Drive folder.
Latest verified commits on 2026-05-27:
- root persistent backup:
d286a03 - Uma persistent backup:
bbad574 - ops docs/systemd templates: update after this runbook commit
Fast Rebuild Order
1. Prepare Base VM
Install the minimum system packages:
apt-get update
apt-get install -y git curl rsync python3 python3-venv nodejs npm systemd
Create Uma if missing:
id uma || useradd -m -s /bin/bash uma
loginctl enable-linger uma
2. Restore Git Access
Root is the operator for both root and Uma repo pushes.
Restore GitHub credentials for root without printing them:
git config --global credential.helper store
chmod 700 /root
# Create /root/.git-credentials from the external secret source.
chmod 600 /root/.git-credentials
Then clone the three recovery repos:
mkdir -p /root/repos /home/uma/repos
git clone https://github.com/saravanakumardb/learning_ai_devops_tools.git /root/repos/learning_ai_devops_tools
git clone https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git /root/repos/bytelyst_hostinger_hermes_vm
git clone https://github.com/umadev0931/uma_hostinger_hermes_vm.git /home/uma/repos/uma_hostinger_hermes_vm
chown -R uma:uma /home/uma/repos
3. Install Hermes Source
Use the official Hermes source and the same shared install path:
mkdir -p /usr/local/lib
git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent
cd /usr/local/lib/hermes-agent
python3 -m venv venv
./venv/bin/pip install -e .
If the repo provides a setup/update script in the future, prefer the official upstream instructions, then verify:
/usr/local/lib/hermes-agent/venv/bin/hermes --version
4. Restore Root Hermes Persistent Data
HERMES_HOME=/root/.hermes \
/root/repos/bytelyst_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
/root/repos/bytelyst_hostinger_hermes_vm/hermes_persistent_backup
Re-enter secrets from the external source into /root/.hermes/.env or via Hermes auth flows. Do not copy secrets from docs or chat.
Verify:
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
5. Restore Uma Hermes Persistent Data
mkdir -p /home/uma/.hermes
HERMES_HOME=/home/uma/.hermes \
/home/uma/repos/uma_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
/home/uma/repos/uma_hostinger_hermes_vm/hermes_persistent_backup
chown -R uma:uma /home/uma/.hermes
Re-enter Uma secrets from the external source into /home/uma/.hermes/.env or via Hermes auth flows.
Verify:
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
6. Reinstall Systemd Units
cp /root/repos/learning_ai_devops_tools/systemd/hermes-gateway.service /etc/systemd/system/hermes-gateway.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-dashboard.service /etc/systemd/system/hermes-root-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-dashboard.service /etc/systemd/system/uma-hermes-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.service /etc/systemd/system/hermes-root-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.timer /etc/systemd/system/hermes-root-backup.timer
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.service /etc/systemd/system/uma-hermes-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.timer /etc/systemd/system/uma-hermes-backup.timer
Install Uma user gateway:
mkdir -p /home/uma/.config/systemd/user
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-gateway.service /home/uma/.config/systemd/user/uma-hermes-gateway.service
chown -R uma:uma /home/uma/.config
Enable services:
systemctl daemon-reload
systemctl enable --now hermes-gateway.service
systemctl enable --now hermes-root-backup.timer uma-hermes-backup.timer
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user daemon-reload
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user enable --now uma-hermes-gateway.service
7. Reconnect Tailscale And Dashboards
curl -fsSL https://tailscale.com/install.sh | sh
systemctl enable --now tailscaled
tailscale up
tailscale ip -4
Update the dashboard service files if the new Tailscale IP differs from the old 100.87.53.10, then:
systemctl daemon-reload
systemctl enable --now hermes-root-dashboard.service uma-hermes-dashboard.service
8. Final Verification
systemctl status hermes-gateway.service --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user status uma-hermes-gateway.service --no-pager
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
systemctl list-timers --all --no-pager | grep 'hermes.*backup'
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py
HERMES_HOME=/home/uma/.hermes HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py
Telegram smoke tests:
- send root Hermes:
Hi - send Uma/Bheem Hermes:
Hi - verify both reply without model-provider errors
- verify root and Uma dashboards return HTTP 200 on the current Tailscale IP/ports
Restore Test Evidence
Root restore test on 2026-05-27:
- restored into
/tmp/hermes-restore-test-root-current MANIFEST.jsonsource:/root/.hermes- restored file count:
751 - restored cron job count:
1 - confirmed absent:
state.db,auth.json,logs/
Uma restore test on 2026-05-27:
- restored into
/tmp/hermes-restore-test-uma MANIFEST.jsonsource:/home/uma/.hermes- restored file count:
600 - restored cron job count:
2 - confirmed absent:
state.db,auth.json,logs/
Hard Rule During Recovery
Do not expose Hermes dashboard/API publicly during rebuild. Use only local shell, SSH tunnel, or Tailscale/private network unless S explicitly approves the hostname, authentication gate, and access path.