bytelyst/bytelyst-devops-tools

Fork 0

Saravana Kumar e2db92f3b1 Add Hermes snapshot diff view

2026-05-27 21:05:57 +00:00

15 KiB

Raw Blame History

ByteLyst Hermes Operations Runbook

Operational runbook for the private Telegram-driven Hermes Agent setup on the ByteLyst VM.

Current baseline

Observed on 2026-05-27:

Hermes version: v0.14.0 (2026.5.16)
Shared source checkout: /usr/local/lib/hermes-agent at upstream 0b6ace649 after the 2026-05-27 late upgrade pass
Install path: /usr/local/lib/hermes-agent
Active profile: default
Primary provider: OpenAI Codex OAuth
Root Telegram gateway: hermes-gateway.service, system service, enabled and running
Uma Telegram gateway: uma-hermes-gateway.service, user service for uma, enabled and running
Root and Uma default model: gpt-5.5, model.routing.enabled: false
Shared local fallback chain via Ollama on demand:
- qwen2.5-coder:1.5b
- llama3.2:1b
- llama3.2-vision
These local fallbacks are loaded on demand and answer within the gateway's retry budget on this VM; the larger 3B/7B models were observed to be too slow for the live fallback path here.
Live Hermes session-switch proof: root and Uma both fail over from a forced primary-provider error into the local Ollama chain and return FallbackTest.
Telegram platform-context proof: the same fallback behavior passes when Hermes runs with HERMES_PLATFORM=telegram for both root and Uma. This is platform-context proof, not a separately replayed inbound Telegram network message.
Web backend target: Firecrawl, configured locally on root and Uma with a private API key
Browser automation: enabled on both Hermes gateways; root was smoke-tested privately against https://example.com
Backup cron: Sync Hermes persistent-data backup to GitHub, every 30 minutes, local delivery
Systemd persistent backup timers: hermes-root-backup.timer and uma-hermes-backup.timer, every 10 minutes
Watchdog cron: ByteLyst Hermes gateway/backup/disk watchdog, every 15 minutes, Telegram delivery on failure only
Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
Tailscale: installed and tailscaled enabled/running; authenticated as tailnet IP 100.87.53.10
Private dashboards:
- Root: http://100.87.53.10:9119/, hermes-root-dashboard.service
- Uma: http://100.87.53.10:9120/, uma-hermes-dashboard.service
- Live ops panel shows gateway state, active sessions, refresh delta, cron state, backup freshness, sanitized alerts, and runbook links for both instances.

Safety guardrail: no public Hermes dashboard/API

Before adding any new Caddy hostname, Docker port, or dashboard/API feature, verify that it is not a Hermes dashboard/API public exposure.

# Inspect public Caddy routes and obvious Hermes/API/dashboard references.
docker ps --format '{{.Names}} {{.Ports}}' | grep -i caddy || true
grep -RniE 'hermes|dashboard|api-server|API_SERVER|8000|8080|3000|5173' /etc/caddy /root/bytelyst.ai 2>/dev/null | head -100

# Inspect listening ports. Review any 0.0.0.0 listeners before exposing a hostname.
ss -ltnp

Allowed private access patterns for a future Hermes dashboard:

local-only binding (127.0.0.1)
SSH tunnel
Tailscale/WireGuard private network
Cloudflare Access or equivalent identity gate
basic auth plus IP allowlist only if public routing is unavoidable and explicitly approved

Current private network access:

tailscale status
tailscale ip -4
# Expected server IPv4: 100.87.53.10

Private dashboard services:

systemctl status hermes-root-dashboard --no-pager
systemctl status uma-hermes-dashboard --no-pager
ss -ltnp | grep -E ':(9119|9120)'

# Expected listeners are Tailscale-only:
# 100.87.53.10:9119
# 100.87.53.10:9120

Tracked service unit templates:

systemd/hermes-gateway.service
systemd/uma-hermes-gateway.service
systemd/hermes-root-dashboard.service
systemd/uma-hermes-dashboard.service
systemd/hermes-root-backup.service
systemd/hermes-root-backup.timer
systemd/uma-hermes-backup.service
systemd/uma-hermes-backup.timer

Health baseline commands

hermes --version
hermes config check
hermes doctor --fix
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
df -h /
free -h
ss -ltnp

Notes:

hermes doctor --fix migrated root and Uma configs to version 24 on 2026-05-27.
Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials.
Local Ollama fallback models are installed on demand, not kept hot permanently. Both Hermes instances can reach the shared host service at http://127.0.0.1:11434/v1. The live fallback order is qwen2.5-coder:1.5b -> llama3.2:1b -> llama3.2-vision. gemma4 was attempted but the installed Ollama runtime rejected it, so the vision fallback is llama3.2-vision.

Gateway recovery

systemctl status hermes-gateway --no-pager
journalctl -u hermes-gateway -n 100 --no-pager
hermes gateway restart
# If the CLI restart path is unavailable:
sudo systemctl restart hermes-gateway

# Uma user gateway:
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 journalctl --user -u uma-hermes-gateway -n 100 --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user restart uma-hermes-gateway

After restart, verify from Telegram:

inbound message receives a response
outbound completion messages work
approval prompts still reach the allowed user
media/file delivery works for a known safe file if needed

Cron and watchdogs

List jobs:

hermes cron list

Current watchdog script:

~/.hermes/scripts/hermes_health_watchdog.py

Tracked source copy:

scripts/hermes-health-watchdog.py

Behavior:

no output on success, so the cron stays silent
sends a Telegram message only when it detects an actionable failure
checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
also checks memory pressure plus critical Caddy/Gitea Docker containers (caddy, gitea-npm-registry)

Manual smoke test:

python3 ~/.hermes/scripts/hermes_health_watchdog.py
# Healthy output should be empty.

Persistent backup timers:

systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
systemctl list-timers --all --no-pager | grep 'hermes.*backup'

Backup and restore drill outline

The persistent-data backup repo intentionally excludes raw secrets and state.db.

For full VM rebuild steps, use docs/hermes-disaster-recovery.md.

For break-glass recovery of raw secrets/auth/state that are excluded from GitHub backups, use:

scripts/hermes-emergency-bundle-create.sh
scripts/hermes-emergency-bundle-decrypt.sh
scripts/hermes-emergency-bundle-upload-drive.sh

Store only the encrypted .gpg bundle in Google Drive or similar private storage. Never upload the plaintext staging directory.

Automated Drive upload:

/root/.local/share/hermes-drive-uploader-venv/bin/python scripts/hermes-google-drive-oauth-login.py
systemctl status hermes-emergency-drive-upload.timer --no-pager
systemctl start hermes-emergency-drive-upload.service
journalctl -u hermes-emergency-drive-upload.service -n 80 --no-pager

Personal Google Drive requires OAuth user credentials. A service account can see shared personal folders but cannot upload because it has no personal Drive storage quota.

General one-file Drive upload:

scripts/google-drive-upload-file.sh /path/to/file --target vijay
scripts/google-drive-upload-file.sh /path/to/file --target bheem --encrypt

The general uploader refuses sensitive-looking files by default, including .env, auth tokens, private keys, SQLite DBs, and Google credential files. Use --encrypt for private files. Use --allow-sensitive only after explicit approval.

Telegram usage pattern:

Upload the file I just sent to Vijay Google Drive. Do not print file contents. Find the local attachment path, then use scripts/google-drive-upload-file.sh with --target vijay.

Quarterly restore drill:

Run the backup sync manually or wait for a successful cron run.
Clone the backup repo into a temporary directory.

Inspect git contents for accidental raw secrets:

git grep -nE '(API_KEY|TOKEN|SECRET|PASSWORD|BEGIN .*PRIVATE KEY)' || true

Restore into a non-production Hermes profile/test directory only.
Verify config, skills, sessions JSON exports, cron definitions, memories, and scripts are present.
Confirm .env, OAuth files, SQLite WAL/SHM files, logs, caches, and raw state.db are absent.
Delete the temporary restore directory when done.

2026-05-27 restore rehearsal:

Restored root backup into /tmp/hermes-restore-test-root.
Verified portable directories/files were present: config.yaml, skills/, sessions/, cron/, memories/, and scripts.
Verified raw state.db was absent.
Scanned restored .env template and config.yaml for common token patterns; no hits.

Upgrade checklist

Before upgrade:

hermes --version
hermes status --all
hermes config check
hermes cron list
python3 ~/.hermes/scripts/sync_hermes_persistent_backup.py

Upgrade from an interactive/private shell only:

hermes update

After upgrade:

hermes doctor --fix
hermes gateway restart
hermes --version
hermes status --all
hermes cron list
python3 ~/.hermes/scripts/hermes_health_watchdog.py

Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap.

2026-05-27 late upgrade pass:

Backed up root/Uma configs and service units under /root/hermes-fix-backups/20260527-roadmap-noncreds/.
Fast-forwarded /usr/local/lib/hermes-agent to upstream 0b6ace649.
Restarted both gateways.
Verified provider smoke tests with exact responses root-roadmap-ok and uma-roadmap-ok.

Provider and tool changes

Use Hermes flows rather than editing secrets into git-tracked files:

hermes model
hermes setup model
hermes tools list
hermes tools enable <toolset>
hermes tools disable <toolset>

Restart/reset requirement:

gateway config changes: /restart from Telegram or hermes gateway restart
CLI session tool changes: start a new session or /reset
provider auth changes: start a new session after switching models/providers

Safe local Gitea Git token flow

Root Hermes has a least-privilege local Gitea Git path for repository reads:

token file: /root/.gitea_npm_token_home
askpass helper: /root/.local/bin/gitea-git-askpass
Git wrapper: /root/.local/bin/gitea-git
default username: learning_ai_user
local Gitea URL: http://localhost:3300

The token value must never be placed in a remote URL, shell history, Git config, docs, logs, or Hermes chat. The wrapper sets GIT_TERMINAL_PROMPT=0 and GIT_ASKPASS=/root/.local/bin/gitea-git-askpass; the askpass helper reads the token from the root-only token file only when Git prompts for a password.

Safe read-only test:

/root/.local/bin/gitea-git ls-remote http://localhost:3300/bytelyst/learning_ai_common_plat.git HEAD

Hermes-safe prompt pattern:

Use the terminal tool only. Run exactly this read-only command and report only whether it succeeded and the first 12 characters of the HEAD hash: /root/.local/bin/gitea-git ls-remote http://localhost:3300/bytelyst/learning_ai_common_plat.git HEAD. Do not print any token, credential, environment variable, or file contents.

Verification recorded on 2026-05-27:

local Gitea version endpoint returned 1.22.6
token file permissions are root-only
profile-read API access returned a scope denial, confirming the token is not broad enough for user-profile reads
direct wrapper test returned HEAD 59c4638f85be...
Hermes one-shot test reported success with truncated HEAD 59c4638f85be

For write operations, create a separate repo-scoped token and store it in a new root-only token file. Do not reuse this read-focused token for broad automation unless the required scope is explicitly reviewed first.

GitHub credential ownership

Root Git operations already have GitHub push credentials through the root Git credential store. Root is the operator account for both:

https://github.com/saravanakumardb/learning_ai_devops_tools.git
https://github.com/umadev0931/uma_hostinger_hermes_vm.git

Uma does not need a separate /home/uma/.git-credentials file for the current workflow because repo maintenance and pushes are performed from root. Do not copy root GitHub credentials into Uma's home directory unless there is a concrete need for Uma-user GitHub pushes.

Remaining audit item: confirm in GitHub that the root token is fine-grained or otherwise limited to the intended repos and permissions. Do not print the token while checking this.

Telegram topics and session handling

Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need.

Review these before changing Telegram routing:

systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100

Multi-agent execution conventions

Use the smallest execution surface that fits the task:

direct tool call: one-shot local checks, edits, commits, pushes, status reads
delegate_task: bounded research or code inspection that can return inside the parent session
spawned Hermes/tmux session: long-running mission that must outlive the parent turn
background terminal process: long-running local commands that need monitoring
cron job: recurring, deterministic, silent-on-success maintenance
worktree: independent coding agent branch space when tasks can overlap
Kanban worker: durable multi-agent project coordination after the board is intentionally configured

Telegram progress/completion updates should keep the user's numbered-prefix convention (1, 2, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable.

Workflow Skills And Memory Hygiene

Repeated operational procedures should be turned into skills instead of being kept as long-lived memories.

Pinned skills that should stay available:

devops/self-hosted-gitea-ci
devops/caddy-subdomain-routing
devops/hermes-persistent-backup-ops
devops/hermes-gateway-operations
safe multi-repo commit/push workflow

Memory hygiene policy:

keep memories declarative and durable
trim stale or task-completion artifacts before they accumulate
review persistent memories and recurring workflow skills on a manual maintenance pass
if curator reviews are enabled, run them on a regular cadence rather than letting them drift

Safe Multi-Repo Commit And Push

Root is the operator for both the root and Uma tracking repos.

Safe sequence:

Work in the target repo only.
Run the repo's tests or checks before committing.
Commit the smallest coherent change.
Push from root using the already-approved GitHub credential path.
Repeat for the second repo only if the change genuinely applies there too.

Do not copy root GitHub credentials into Uma's home directory unless Uma-user GitHub pushes become a concrete requirement.

15 KiB Raw Blame History