bytelyst-devops-tools/docs/hermes-operations.md

# ByteLyst Hermes Operations Runbook

Operational runbook for the private Telegram-driven Hermes Agent setup on the ByteLyst VM.

## Current baseline

Observed on 2026-05-27:

- Hermes version: `v0.14.0 (2026.5.16)`
- Shared source checkout: `/usr/local/lib/hermes-agent` at upstream `0b6ace649` after the 2026-05-27 late upgrade pass
- Install path: `/usr/local/lib/hermes-agent`
- Active profile: `default`
- Primary provider: OpenAI Codex OAuth
- Root Telegram gateway: `hermes-gateway.service`, system service, enabled and running
- Uma Telegram gateway: `uma-hermes-gateway.service`, user service for `uma`, enabled and running
- Root and Uma default model: `gpt-5.5`, `model.routing.enabled: false`
- Shared local fallback chain via Ollama on demand:
  - `qwen2.5-coder:7b`
  - `llama3.1:8b`
  - `llama3.2-vision`
- Web backend target: Firecrawl, configured locally on root and Uma with a private API key
- Browser automation: enabled on both Hermes gateways; root was smoke-tested privately against `https://example.com`
- Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery
- Systemd persistent backup timers: `hermes-root-backup.timer` and `uma-hermes-backup.timer`, every 10 minutes
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
- Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10`
- Private dashboards:
  - Root: `http://100.87.53.10:9119/`, `hermes-root-dashboard.service`
  - Uma: `http://100.87.53.10:9120/`, `uma-hermes-dashboard.service`

## Safety guardrail: no public Hermes dashboard/API

Before adding any new Caddy hostname, Docker port, or dashboard/API feature, verify that it is not a Hermes dashboard/API public exposure.

```bash
# Inspect public Caddy routes and obvious Hermes/API/dashboard references.
docker ps --format '{{.Names}} {{.Ports}}' | grep -i caddy || true
grep -RniE 'hermes|dashboard|api-server|API_SERVER|8000|8080|3000|5173' /etc/caddy /root/bytelyst.ai 2>/dev/null | head -100

# Inspect listening ports. Review any 0.0.0.0 listeners before exposing a hostname.
ss -ltnp
```

Allowed private access patterns for a future Hermes dashboard:

1. local-only binding (`127.0.0.1`)
2. SSH tunnel
3. Tailscale/WireGuard private network
4. Cloudflare Access or equivalent identity gate
5. basic auth plus IP allowlist only if public routing is unavoidable and explicitly approved

Current private network access:

```bash
tailscale status
tailscale ip -4
# Expected server IPv4: 100.87.53.10
```

Private dashboard services:

```bash
systemctl status hermes-root-dashboard --no-pager
systemctl status uma-hermes-dashboard --no-pager
ss -ltnp | grep -E ':(9119|9120)'

# Expected listeners are Tailscale-only:
# 100.87.53.10:9119
# 100.87.53.10:9120
```

Tracked service unit templates:

```bash
systemd/hermes-gateway.service
systemd/uma-hermes-gateway.service
systemd/hermes-root-dashboard.service
systemd/uma-hermes-dashboard.service
systemd/hermes-root-backup.service
systemd/hermes-root-backup.timer
systemd/uma-hermes-backup.service
systemd/uma-hermes-backup.timer
```

## Health baseline commands

```bash
hermes --version
hermes config check
hermes doctor --fix
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
df -h /
free -h
ss -ltnp
```

Notes:

- `hermes doctor --fix` migrated root and Uma configs to version `24` on 2026-05-27.
- Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials.
- Local Ollama fallback models are installed on demand, not kept hot permanently. Both Hermes instances can reach the shared host service at `http://127.0.0.1:11434/v1`. `gemma4` was attempted but the installed Ollama runtime rejected it, so the vision fallback is `llama3.2-vision`.

## Gateway recovery

```bash
systemctl status hermes-gateway --no-pager
journalctl -u hermes-gateway -n 100 --no-pager
hermes gateway restart
# If the CLI restart path is unavailable:
sudo systemctl restart hermes-gateway

# Uma user gateway:
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 journalctl --user -u uma-hermes-gateway -n 100 --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user restart uma-hermes-gateway
```

After restart, verify from Telegram:

- inbound message receives a response
- outbound completion messages work
- approval prompts still reach the allowed user
- media/file delivery works for a known safe file if needed

## Cron and watchdogs

List jobs:

```bash
hermes cron list
```

Current watchdog script:

```bash
~/.hermes/scripts/hermes_health_watchdog.py
```

Tracked source copy:

```bash
scripts/hermes-health-watchdog.py
```

Behavior:

- no output on success, so the cron stays silent
- sends a Telegram message only when it detects an actionable failure
- checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
- also checks memory pressure plus critical Caddy/Gitea Docker containers (`caddy`, `gitea-npm-registry`)

Manual smoke test:

```bash
python3 ~/.hermes/scripts/hermes_health_watchdog.py
# Healthy output should be empty.
```

Persistent backup timers:

```bash
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
systemctl list-timers --all --no-pager | grep 'hermes.*backup'
```

## Backup and restore drill outline

The persistent-data backup repo intentionally excludes raw secrets and `state.db`.

For full VM rebuild steps, use `docs/hermes-disaster-recovery.md`.

For break-glass recovery of raw secrets/auth/state that are excluded from GitHub backups, use:

```bash
scripts/hermes-emergency-bundle-create.sh
scripts/hermes-emergency-bundle-decrypt.sh
scripts/hermes-emergency-bundle-upload-drive.sh
```

Store only the encrypted `.gpg` bundle in Google Drive or similar private storage. Never upload the plaintext staging directory.

Automated Drive upload:

```bash
/root/.local/share/hermes-drive-uploader-venv/bin/python scripts/hermes-google-drive-oauth-login.py
systemctl status hermes-emergency-drive-upload.timer --no-pager
systemctl start hermes-emergency-drive-upload.service
journalctl -u hermes-emergency-drive-upload.service -n 80 --no-pager
```

Personal Google Drive requires OAuth user credentials. A service account can see shared personal folders but cannot upload because it has no personal Drive storage quota.

General one-file Drive upload:

```bash
scripts/google-drive-upload-file.sh /path/to/file --target vijay
scripts/google-drive-upload-file.sh /path/to/file --target bheem --encrypt
```

The general uploader refuses sensitive-looking files by default, including `.env`, auth tokens, private keys, SQLite DBs, and Google credential files. Use `--encrypt` for private files. Use `--allow-sensitive` only after explicit approval.

Telegram usage pattern:

```text
Upload the file I just sent to Vijay Google Drive. Do not print file contents. Find the local attachment path, then use scripts/google-drive-upload-file.sh with --target vijay.
```

Quarterly restore drill:

1. Run the backup sync manually or wait for a successful cron run.
2. Clone the backup repo into a temporary directory.
3. Inspect git contents for accidental raw secrets:
   ```bash
   git grep -nE '(API_KEY|TOKEN|SECRET|PASSWORD|BEGIN .*PRIVATE KEY)' || true
   ```
4. Restore into a non-production Hermes profile/test directory only.
5. Verify config, skills, sessions JSON exports, cron definitions, memories, and scripts are present.
6. Confirm `.env`, OAuth files, SQLite WAL/SHM files, logs, caches, and raw `state.db` are absent.
7. Delete the temporary restore directory when done.

2026-05-27 restore rehearsal:

- Restored root backup into `/tmp/hermes-restore-test-root`.
- Verified portable directories/files were present: `config.yaml`, `skills/`, `sessions/`, `cron/`, `memories/`, and scripts.
- Verified raw `state.db` was absent.
- Scanned restored `.env` template and `config.yaml` for common token patterns; no hits.

## Upgrade checklist

Before upgrade:

```bash
hermes --version
hermes status --all
hermes config check
hermes cron list
python3 ~/.hermes/scripts/sync_hermes_persistent_backup.py
```

Upgrade from an interactive/private shell only:

```bash
hermes update
```

After upgrade:

```bash
hermes doctor --fix
hermes gateway restart
hermes --version
hermes status --all
hermes cron list
python3 ~/.hermes/scripts/hermes_health_watchdog.py
```

Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap.

2026-05-27 late upgrade pass:

- Backed up root/Uma configs and service units under `/root/hermes-fix-backups/20260527-roadmap-noncreds/`.
- Fast-forwarded `/usr/local/lib/hermes-agent` to upstream `0b6ace649`.
- Restarted both gateways.
- Verified provider smoke tests with exact responses `root-roadmap-ok` and `uma-roadmap-ok`.

## Provider and tool changes

Use Hermes flows rather than editing secrets into git-tracked files:

```bash
hermes model
hermes setup model
hermes tools list
hermes tools enable <toolset>
hermes tools disable <toolset>
```

Restart/reset requirement:

- gateway config changes: `/restart` from Telegram or `hermes gateway restart`
- CLI session tool changes: start a new session or `/reset`
- provider auth changes: start a new session after switching models/providers

## Safe local Gitea Git token flow

Root Hermes has a least-privilege local Gitea Git path for repository reads:

- token file: `/root/.gitea_npm_token_home`
- askpass helper: `/root/.local/bin/gitea-git-askpass`
- Git wrapper: `/root/.local/bin/gitea-git`
- default username: `learning_ai_user`
- local Gitea URL: `http://localhost:3300`

The token value must never be placed in a remote URL, shell history, Git config, docs, logs, or Hermes chat. The wrapper sets `GIT_TERMINAL_PROMPT=0` and `GIT_ASKPASS=/root/.local/bin/gitea-git-askpass`; the askpass helper reads the token from the root-only token file only when Git prompts for a password.

Safe read-only test:

```bash
/root/.local/bin/gitea-git ls-remote http://localhost:3300/bytelyst/learning_ai_common_plat.git HEAD
```

Hermes-safe prompt pattern:

```text
Use the terminal tool only. Run exactly this read-only command and report only whether it succeeded and the first 12 characters of the HEAD hash: /root/.local/bin/gitea-git ls-remote http://localhost:3300/bytelyst/learning_ai_common_plat.git HEAD. Do not print any token, credential, environment variable, or file contents.
```

Verification recorded on 2026-05-27:

- local Gitea version endpoint returned `1.22.6`
- token file permissions are root-only
- profile-read API access returned a scope denial, confirming the token is not broad enough for user-profile reads
- direct wrapper test returned HEAD `59c4638f85be...`
- Hermes one-shot test reported success with truncated HEAD `59c4638f85be`

For write operations, create a separate repo-scoped token and store it in a new root-only token file. Do not reuse this read-focused token for broad automation unless the required scope is explicitly reviewed first.

## GitHub credential ownership

Root Git operations already have GitHub push credentials through the root Git credential store. Root is the operator account for both:

- `https://github.com/saravanakumardb/learning_ai_devops_tools.git`
- `https://github.com/umadev0931/uma_hostinger_hermes_vm.git`

Uma does not need a separate `/home/uma/.git-credentials` file for the current workflow because repo maintenance and pushes are performed from root. Do not copy root GitHub credentials into Uma's home directory unless there is a concrete need for Uma-user GitHub pushes.

Remaining audit item: confirm in GitHub that the root token is fine-grained or otherwise limited to the intended repos and permissions. Do not print the token while checking this.

## Telegram topics and session handling

Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need.

Review these before changing Telegram routing:

```bash
systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100
```

## Multi-agent execution conventions

Use the smallest execution surface that fits the task:

- direct tool call: one-shot local checks, edits, commits, pushes, status reads
- `delegate_task`: bounded research or code inspection that can return inside the parent session
- background terminal process: long-running local commands that need monitoring
- cron job: recurring, deterministic, silent-on-success maintenance
- Kanban worker: durable multi-agent project coordination after the board is intentionally configured

Telegram progress/completion updates should keep the user's numbered-prefix convention (`1`, `2`, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable.