bytelyst-devops-tools/docs/hermes-disaster-recovery.md
2026-05-27 11:23:07 +00:00

8.8 KiB

Hermes Disaster Recovery Runbook

Goal: rebuild the ByteLyst root Hermes and Uma/Bheem Hermes setup on a new VM quickly, with durable memory, sessions, cron definitions, skills, scripts, and dashboard/service configuration restored from GitHub-backed artifacts.

Last verified: 2026-05-27.

Current Recovery Confidence

High for durable Hermes state. Both root and Uma now have sanitized .hermes persistent backups pushed to GitHub and recurring systemd backup timers.

What is recoverable:

  • root Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
  • Uma Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
  • root and Uma gateway systemd unit definitions
  • root and Uma private dashboard systemd unit definitions
  • root and Uma backup timer systemd unit definitions
  • Uma wrapper/memory/docs repo content
  • root operational docs and rebuild knowledge in this repo

What still requires operator-provided credentials or re-authentication:

  • GitHub token or credentials for clone/push if the new VM does not already have them
  • OpenAI Codex OAuth/provider login
  • Telegram bot/user credentials if not restored from an external secret manager
  • Tailscale login for the new machine
  • any optional provider/search/browser API keys

What is intentionally not restored from git:

  • raw .env secret values
  • Hermes auth.json
  • raw state.db, SQLite WAL/SHM files, logs, cache directories, sandboxes, locks, and PIDs
  • live OS processes or in-flight terminal commands that were running at the exact moment the VM was lost

Expected data-loss window:

  • durable backups run every 10 minutes through systemd timers
  • latest in-memory/live process activity since the last backup may need manual reconstruction from Telegram/GitHub context

Backup Sources

Instance GitHub repo Backup path Recurring sync
root/vijay https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git hermes_persistent_backup/ hermes-root-backup.timer every 10 minutes
Uma/bheem https://github.com/umadev0931/uma_hostinger_hermes_vm.git hermes_persistent_backup/ uma-hermes-backup.timer every 10 minutes
ops docs https://github.com/saravanakumardb/learning_ai_devops_tools.git docs/, systemd/, scripts/ pushed manually after changes

Latest verified commits on 2026-05-27:

  • root persistent backup: d286a03
  • Uma persistent backup: bbad574
  • ops docs/systemd templates: update after this runbook commit

Fast Rebuild Order

1. Prepare Base VM

Install the minimum system packages:

apt-get update
apt-get install -y git curl rsync python3 python3-venv nodejs npm systemd

Create Uma if missing:

id uma || useradd -m -s /bin/bash uma
loginctl enable-linger uma

2. Restore Git Access

Root is the operator for both root and Uma repo pushes.

Restore GitHub credentials for root without printing them:

git config --global credential.helper store
chmod 700 /root
# Create /root/.git-credentials from the external secret source.
chmod 600 /root/.git-credentials

Then clone the three recovery repos:

mkdir -p /root/repos /home/uma/repos
git clone https://github.com/saravanakumardb/learning_ai_devops_tools.git /root/repos/learning_ai_devops_tools
git clone https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git /root/repos/bytelyst_hostinger_hermes_vm
git clone https://github.com/umadev0931/uma_hostinger_hermes_vm.git /home/uma/repos/uma_hostinger_hermes_vm
chown -R uma:uma /home/uma/repos

3. Install Hermes Source

Use the official Hermes source and the same shared install path:

mkdir -p /usr/local/lib
git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent
cd /usr/local/lib/hermes-agent
python3 -m venv venv
./venv/bin/pip install -e .

If the repo provides a setup/update script in the future, prefer the official upstream instructions, then verify:

/usr/local/lib/hermes-agent/venv/bin/hermes --version

4. Restore Root Hermes Persistent Data

HERMES_HOME=/root/.hermes \
  /root/repos/bytelyst_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
  /root/repos/bytelyst_hostinger_hermes_vm/hermes_persistent_backup

Re-enter secrets from the external source into /root/.hermes/.env or via Hermes auth flows. Do not copy secrets from docs or chat.

Verify:

HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list

5. Restore Uma Hermes Persistent Data

mkdir -p /home/uma/.hermes
HERMES_HOME=/home/uma/.hermes \
  /home/uma/repos/uma_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
  /home/uma/repos/uma_hostinger_hermes_vm/hermes_persistent_backup
chown -R uma:uma /home/uma/.hermes

Re-enter Uma secrets from the external source into /home/uma/.hermes/.env or via Hermes auth flows.

Verify:

sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list

6. Reinstall Systemd Units

cp /root/repos/learning_ai_devops_tools/systemd/hermes-gateway.service /etc/systemd/system/hermes-gateway.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-dashboard.service /etc/systemd/system/hermes-root-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-dashboard.service /etc/systemd/system/uma-hermes-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.service /etc/systemd/system/hermes-root-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.timer /etc/systemd/system/hermes-root-backup.timer
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.service /etc/systemd/system/uma-hermes-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.timer /etc/systemd/system/uma-hermes-backup.timer

Install Uma user gateway:

mkdir -p /home/uma/.config/systemd/user
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-gateway.service /home/uma/.config/systemd/user/uma-hermes-gateway.service
chown -R uma:uma /home/uma/.config

Enable services:

systemctl daemon-reload
systemctl enable --now hermes-gateway.service
systemctl enable --now hermes-root-backup.timer uma-hermes-backup.timer

sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user daemon-reload
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user enable --now uma-hermes-gateway.service

7. Reconnect Tailscale And Dashboards

curl -fsSL https://tailscale.com/install.sh | sh
systemctl enable --now tailscaled
tailscale up
tailscale ip -4

Update the dashboard service files if the new Tailscale IP differs from the old 100.87.53.10, then:

systemctl daemon-reload
systemctl enable --now hermes-root-dashboard.service uma-hermes-dashboard.service

8. Final Verification

systemctl status hermes-gateway.service --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user status uma-hermes-gateway.service --no-pager
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
systemctl list-timers --all --no-pager | grep 'hermes.*backup'

HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list

python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py
HERMES_HOME=/home/uma/.hermes HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py

Telegram smoke tests:

  • send root Hermes: Hi
  • send Uma/Bheem Hermes: Hi
  • verify both reply without model-provider errors
  • verify root and Uma dashboards return HTTP 200 on the current Tailscale IP/ports

Restore Test Evidence

Root restore test on 2026-05-27:

  • restored into /tmp/hermes-restore-test-root-current
  • MANIFEST.json source: /root/.hermes
  • restored file count: 751
  • restored cron job count: 1
  • confirmed absent: state.db, auth.json, logs/

Uma restore test on 2026-05-27:

  • restored into /tmp/hermes-restore-test-uma
  • MANIFEST.json source: /home/uma/.hermes
  • restored file count: 600
  • restored cron job count: 2
  • confirmed absent: state.db, auth.json, logs/

Hard Rule During Recovery

Do not expose Hermes dashboard/API publicly during rebuild. Use only local shell, SSH tunnel, or Tailscale/private network unless S explicitly approves the hostname, authentication gate, and access path.