bytelyst-devops-tools/docs/hermes-disaster-recovery.md

322 lines
12 KiB
Markdown

# Hermes Disaster Recovery Runbook
Goal: rebuild the ByteLyst root Hermes and Uma/Bheem Hermes setup on a new VM quickly, with durable memory, sessions, cron definitions, skills, scripts, and dashboard/service configuration restored from GitHub-backed artifacts.
Last verified: 2026-05-27.
## Current Recovery Confidence
**High for durable Hermes state.** Both root and Uma now have sanitized `.hermes` persistent backups pushed to GitHub and recurring systemd backup timers.
What is recoverable:
- root Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
- Uma Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
- root and Uma gateway systemd unit definitions
- root and Uma private dashboard systemd unit definitions
- root and Uma backup timer systemd unit definitions
- Uma wrapper/memory/docs repo content
- root operational docs and rebuild knowledge in this repo
What still requires operator-provided credentials or re-authentication:
- GitHub token or credentials for clone/push if the new VM does not already have them
- OpenAI Codex OAuth/provider login, unless restored from an encrypted emergency bundle
- Telegram bot/user credentials, unless restored from an encrypted emergency bundle
- Tailscale login for the new machine, unless restoring Tailscale state is explicitly chosen
- any optional provider/search/browser API keys
What is intentionally not restored from git:
- raw `.env` secret values
- Hermes `auth.json`
- raw `state.db`, SQLite WAL/SHM files, logs, cache directories, sandboxes, locks, and PIDs
- live OS processes or in-flight terminal commands that were running at the exact moment the VM was lost
Expected data-loss window:
- durable backups run every 10 minutes through systemd timers
- latest in-memory/live process activity since the last backup may need manual reconstruction from Telegram/GitHub context
## Backup Sources
| Instance | GitHub repo | Backup path | Recurring sync |
| --- | --- | --- | --- |
| root/vijay | `https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `hermes-root-backup.timer` every 10 minutes |
| Uma/bheem | `https://github.com/umadev0931/uma_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `uma-hermes-backup.timer` every 10 minutes |
| ops docs | `https://github.com/saravanakumardb/learning_ai_devops_tools.git` | `docs/`, `systemd/`, `scripts/` | pushed manually after changes |
## Encrypted Emergency Bundle
Normal GitHub backups are sanitized and intentionally exclude raw secrets, auth state, and raw `state.db`. For faster break-glass recovery, create a separate encrypted bundle and store the encrypted `.gpg` file in Google Drive or another private location.
Create bundle on the old/current VM:
```bash
/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-create.sh
```
The script creates:
```text
/root/hermes-emergency-bundles/hermes-emergency-bundle-<host>-<timestamp>.tar.zst.gpg
```
It includes an allow-list only:
- `/root/.hermes/.env`, `auth.json`, `state.db*`
- `/home/uma/.hermes/.env`, `auth.json`, `state.db*`
- `/root/.git-credentials`
- `/root/.gitea_admin_password`, `/root/.gitea_npm_token`, `/root/.gitea_npm_token_home`
- `/var/lib/tailscale/tailscaled.state`
It does not include logs, caches, locks, PIDs, or sandboxes.
Decrypt on a new VM into staging only:
```bash
/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-decrypt.sh \
/path/to/hermes-emergency-bundle.tar.zst.gpg
```
The decrypt script extracts to `/root/hermes-emergency-restore-staging/...` by default. It does not overwrite live `.hermes` or credential files. Inspect the staging directory first, then manually copy only the files needed for the recovery.
For unattended operation, both scripts support:
```bash
export BUNDLE_PASSPHRASE_FILE=/root/path/to/passphrase-file
```
Keep the passphrase outside GitHub and outside the encrypted bundle.
Automated Google Drive upload for personal Drive uses OAuth user credentials, not the service account.
Why: service accounts can read metadata for folders shared from personal Drive, but personal Drive uploads fail because service accounts do not have personal Drive storage quota. Use the service account path only for Shared Drives or Workspace delegation.
Personal Drive OAuth setup:
1. In Google Cloud Console, create an OAuth client of type **Desktop app** in the `hermes-emergency-backups` project.
2. Save the downloaded JSON as:
```text
/root/.config/hermes-google-drive/oauth-client.json
```
3. Run:
```bash
/root/.local/share/hermes-drive-uploader-venv/bin/python \
/root/repos/learning_ai_devops_tools/scripts/hermes-google-drive-oauth-login.py
```
4. Open the printed URL, approve access, paste the code back in the terminal.
5. Confirm `/root/.config/hermes-google-drive/user-token.json` exists with mode `600`.
Automated Google Drive upload is configured to use:
- OAuth client: `/root/.config/hermes-google-drive/oauth-client.json`
- OAuth token: `/root/.config/hermes-google-drive/user-token.json`
- passphrase file: `/root/.config/hermes-google-drive/bundle-passphrase`
- uploader venv: `/root/.local/share/hermes-drive-uploader-venv`
- uploader script: `scripts/hermes-emergency-bundle-upload-drive.sh`
- timer: `hermes-emergency-drive-upload.timer`, daily around `03:17 UTC`
Drive targets:
- Vijay folder: `1KIlSJzpf5fuaH5LYvfbLsUbOSYY23YGm`
- Bheem folder: `1Ac5cbDC0dSWas8LeeWe_9XFqCquz7kZT`
The uploader creates one encrypted bundle and uploads the same encrypted file to both folders. It keeps the latest 12 encrypted bundles per Drive folder.
Latest verified commits on 2026-05-27:
- root persistent backup: `d286a03`
- Uma persistent backup: `bbad574`
- ops docs/systemd templates: update after this runbook commit
## Fast Rebuild Order
### 1. Prepare Base VM
Install the minimum system packages:
```bash
apt-get update
apt-get install -y git curl rsync python3 python3-venv nodejs npm systemd
```
Create Uma if missing:
```bash
id uma || useradd -m -s /bin/bash uma
loginctl enable-linger uma
```
### 2. Restore Git Access
Root is the operator for both root and Uma repo pushes.
Restore GitHub credentials for root without printing them:
```bash
git config --global credential.helper store
chmod 700 /root
# Create /root/.git-credentials from the external secret source.
chmod 600 /root/.git-credentials
```
Then clone the three recovery repos:
```bash
mkdir -p /root/repos /home/uma/repos
git clone https://github.com/saravanakumardb/learning_ai_devops_tools.git /root/repos/learning_ai_devops_tools
git clone https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git /root/repos/bytelyst_hostinger_hermes_vm
git clone https://github.com/umadev0931/uma_hostinger_hermes_vm.git /home/uma/repos/uma_hostinger_hermes_vm
chown -R uma:uma /home/uma/repos
```
### 3. Install Hermes Source
Use the official Hermes source and the same shared install path:
```bash
mkdir -p /usr/local/lib
git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent
cd /usr/local/lib/hermes-agent
python3 -m venv venv
./venv/bin/pip install -e .
```
If the repo provides a setup/update script in the future, prefer the official upstream instructions, then verify:
```bash
/usr/local/lib/hermes-agent/venv/bin/hermes --version
```
### 4. Restore Root Hermes Persistent Data
```bash
HERMES_HOME=/root/.hermes \
/root/repos/bytelyst_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
/root/repos/bytelyst_hostinger_hermes_vm/hermes_persistent_backup
```
Re-enter secrets from the external source into `/root/.hermes/.env` or via Hermes auth flows. Do not copy secrets from docs or chat.
Verify:
```bash
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
```
### 5. Restore Uma Hermes Persistent Data
```bash
mkdir -p /home/uma/.hermes
HERMES_HOME=/home/uma/.hermes \
/home/uma/repos/uma_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
/home/uma/repos/uma_hostinger_hermes_vm/hermes_persistent_backup
chown -R uma:uma /home/uma/.hermes
```
Re-enter Uma secrets from the external source into `/home/uma/.hermes/.env` or via Hermes auth flows.
Verify:
```bash
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
```
### 6. Reinstall Systemd Units
```bash
cp /root/repos/learning_ai_devops_tools/systemd/hermes-gateway.service /etc/systemd/system/hermes-gateway.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-dashboard.service /etc/systemd/system/hermes-root-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-dashboard.service /etc/systemd/system/uma-hermes-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.service /etc/systemd/system/hermes-root-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.timer /etc/systemd/system/hermes-root-backup.timer
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.service /etc/systemd/system/uma-hermes-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.timer /etc/systemd/system/uma-hermes-backup.timer
```
Install Uma user gateway:
```bash
mkdir -p /home/uma/.config/systemd/user
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-gateway.service /home/uma/.config/systemd/user/uma-hermes-gateway.service
chown -R uma:uma /home/uma/.config
```
Enable services:
```bash
systemctl daemon-reload
systemctl enable --now hermes-gateway.service
systemctl enable --now hermes-root-backup.timer uma-hermes-backup.timer
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user daemon-reload
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user enable --now uma-hermes-gateway.service
```
### 7. Reconnect Tailscale And Dashboards
```bash
curl -fsSL https://tailscale.com/install.sh | sh
systemctl enable --now tailscaled
tailscale up
tailscale ip -4
```
Update the dashboard service files if the new Tailscale IP differs from the old `100.87.53.10`, then:
```bash
systemctl daemon-reload
systemctl enable --now hermes-root-dashboard.service uma-hermes-dashboard.service
```
### 8. Final Verification
```bash
systemctl status hermes-gateway.service --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user status uma-hermes-gateway.service --no-pager
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
systemctl list-timers --all --no-pager | grep 'hermes.*backup'
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py
HERMES_HOME=/home/uma/.hermes HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py
```
Telegram smoke tests:
- send root Hermes: `Hi`
- send Uma/Bheem Hermes: `Hi`
- verify both reply without model-provider errors
- verify root and Uma dashboards return HTTP 200 on the current Tailscale IP/ports
## Restore Test Evidence
Root restore test on 2026-05-27:
- restored into `/tmp/hermes-restore-test-root-current`
- `MANIFEST.json` source: `/root/.hermes`
- restored file count: `751`
- restored cron job count: `1`
- confirmed absent: `state.db`, `auth.json`, `logs/`
Uma restore test on 2026-05-27:
- restored into `/tmp/hermes-restore-test-uma`
- `MANIFEST.json` source: `/home/uma/.hermes`
- restored file count: `600`
- restored cron job count: `2`
- confirmed absent: `state.db`, `auth.json`, `logs/`
## Hard Rule During Recovery
Do not expose Hermes dashboard/API publicly during rebuild. Use only local shell, SSH tunnel, or Tailscale/private network unless S explicitly approves the hostname, authentication gate, and access path.