diff --git a/docs/hermes-operations.md b/docs/hermes-operations.md index 8836406..fb103b9 100644 --- a/docs/hermes-operations.md +++ b/docs/hermes-operations.md @@ -7,13 +7,17 @@ Operational runbook for the private Telegram-driven Hermes Agent setup on the By Observed on 2026-05-27: - Hermes version: `v0.14.0 (2026.5.16)` +- Shared source checkout: `/usr/local/lib/hermes-agent` at upstream `0b6ace649` after the 2026-05-27 late upgrade pass - Install path: `/usr/local/lib/hermes-agent` - Active profile: `default` - Primary provider: OpenAI Codex OAuth -- Telegram gateway: `hermes-gateway.service`, system service, enabled and running +- Root Telegram gateway: `hermes-gateway.service`, system service, enabled and running +- Uma Telegram gateway: `uma-hermes-gateway.service`, user service for `uma`, enabled and running +- Root and Uma default model: `gpt-5.5`, `model.routing.enabled: false` - Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery - Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only - Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval +- Tailscale: installed and `tailscaled` enabled/running; login intentionally deferred until the operator can authenticate the node ## Safety guardrail: no public Hermes dashboard/API @@ -45,6 +49,7 @@ hermes doctor --fix hermes status --all hermes cron list systemctl status hermes-gateway --no-pager +sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager df -h / free -h ss -ltnp @@ -52,7 +57,7 @@ ss -ltnp Notes: -- `hermes doctor --fix` migrated config from version `23` to `24` on 2026-05-27. +- `hermes doctor --fix` migrated root and Uma configs to version `24` on 2026-05-27. - Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials. ## Gateway recovery @@ -63,6 +68,11 @@ journalctl -u hermes-gateway -n 100 --no-pager hermes gateway restart # If the CLI restart path is unavailable: sudo systemctl restart hermes-gateway + +# Uma user gateway: +sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager +sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 journalctl --user -u uma-hermes-gateway -n 100 --no-pager +sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user restart uma-hermes-gateway ``` After restart, verify from Telegram: @@ -122,6 +132,13 @@ Quarterly restore drill: 6. Confirm `.env`, OAuth files, SQLite WAL/SHM files, logs, caches, and raw `state.db` are absent. 7. Delete the temporary restore directory when done. +2026-05-27 restore rehearsal: + +- Restored root backup into `/tmp/hermes-restore-test-root`. +- Verified portable directories/files were present: `config.yaml`, `skills/`, `sessions/`, `cron/`, `memories/`, and scripts. +- Verified raw `state.db` was absent. +- Scanned restored `.env` template and `config.yaml` for common token patterns; no hits. + ## Upgrade checklist Before upgrade: @@ -153,6 +170,13 @@ python3 ~/.hermes/scripts/hermes_health_watchdog.py Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap. +2026-05-27 late upgrade pass: + +- Backed up root/Uma configs and service units under `/root/hermes-fix-backups/20260527-roadmap-noncreds/`. +- Fast-forwarded `/usr/local/lib/hermes-agent` to upstream `0b6ace649`. +- Restarted both gateways. +- Verified provider smoke tests with exact responses `root-roadmap-ok` and `uma-roadmap-ok`. + ## Provider and tool changes Use Hermes flows rather than editing secrets into git-tracked files: diff --git a/docs/hermes-setup-upgrade-roadmap.md b/docs/hermes-setup-upgrade-roadmap.md index a002d46..8807c4d 100644 --- a/docs/hermes-setup-upgrade-roadmap.md +++ b/docs/hermes-setup-upgrade-roadmap.md @@ -8,7 +8,7 @@ ## Completion Status -- **Overall checklist completion:** ~48% (`85/179` checked before the 2026-05-27 late pass). +- **Overall checklist completion:** ~57% (`102/179` checked after auditing root-vs-Uma evidence on 2026-05-27). - **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, tailnet login, GitHub/Gitea tokens, and policy decisions. - vijay: percentage is based on literal Markdown checklist boxes, including nested sub-items. It intentionally counts credential-dependent future work as incomplete. @@ -72,7 +72,7 @@ A healthy ByteLyst Hermes setup should be: ## Roadmap Checklist -> `vijay:` comments are live implementation notes from the 2026-05-27 setup execution pass. Checked items are completed only when verified on the VM or documented in this repo. +> `vijay:` comments are root/ByteLyst Hermes implementation notes. `bheem:` comments are Uma Hermes implementation notes. Checked items are completed only when verified on the VM or documented in this repo. ### Phase 0 — Safety Freeze And Guardrails @@ -87,8 +87,10 @@ A healthy ByteLyst Hermes setup should be: - [x] local-only binding - [x] SSH tunnel - [x] Tailscale/WireGuard - - [x] Cloudflare Access or equivalent identity gate - - [x] basic auth plus IP allowlist only if a public route is unavoidable + - [ ] Cloudflare Access or equivalent identity gate + - vijay: not selected for the current private dashboard path. + - [ ] basic auth plus IP allowlist only if a public route is unavoidable + - vijay: not selected because public routing remains disallowed. - [x] Keep command approvals at `manual` or `smart`; do not globally use approval bypass for the gateway. - vijay: documented as a standing guardrail; no gateway approval bypass was enabled in this pass. @@ -97,8 +99,10 @@ A healthy ByteLyst Hermes setup should be: - [x] Run and capture `hermes --version`. - vijay: captured `Hermes Agent v0.14.0 (2026.5.16)`, project `/usr/local/lib/hermes-agent`, update available. - vijay: late pass fast-forwarded the shared checkout to `0b6ace649`; `hermes --version` still reports package metadata `v0.14.0`. + - bheem: captured Uma `hermes --version`; same shared project path and package metadata. - [x] Run and capture `hermes config check`. - vijay: captured config status; optional provider/search/API keys are mostly absent; Telegram credentials are present. + - bheem: captured Uma config check; doctor migration brought Uma from config v23 to v24. - [x] Investigate why `hermes doctor` timed out. - vijay: reran `timeout 240 hermes doctor --fix`; it completed successfully. - [x] Re-run with a longer timeout from a foreground shell. @@ -108,15 +112,18 @@ A healthy ByteLyst Hermes setup should be: - vijay: not reproducible in this pass; no bug filed. - [x] Run `hermes status --all` and save a sanitized baseline summary. - vijay: baseline summary added to `docs/hermes-operations.md`. - - vijay: late pass verified both root and Uma gateway services active after restart; provider smoke tests returned `root-roadmap-ok` and `uma-roadmap-ok`. + - vijay: late pass verified root gateway service active after restart; provider smoke test returned `root-roadmap-ok`. + - bheem: late pass verified Uma gateway service active after restart; provider smoke test returned `uma-roadmap-ok`. - [x] Check gateway service health: - vijay: `hermes-gateway.service` is active/running under systemd. + - bheem: `uma-hermes-gateway.service` is active/running under Uma's user systemd manager. - [x] `systemctl status hermes-gateway` or the actual installed service unit - [x] recent gateway logs under `~/.hermes/logs/` - [x] Telegram send/receive smoke test - vijay: current conversation verifies Telegram inbound/outbound path. - [x] Check cron scheduler health and last-run status. - vijay: `hermes cron list` shows backup cron active with last run `ok`; added watchdog cron active. + - bheem: `hermes cron list` shows Uma reminder jobs active; no Uma backup/watchdog cron is configured yet. - [x] Check disk, memory, CPU, open ports, and long-running Hermes processes. - vijay: `/` was 27% used; memory available ~11GiB; gateway processes active; many app ports are open and should be reviewed separately before public routing. - [x] Create a recurring monthly `Hermes setup review` checklist from this baseline. @@ -132,12 +139,17 @@ A healthy ByteLyst Hermes setup should be: - vijay: confirmed from established backup design/memory and documented again in `docs/hermes-operations.md`. - [x] Add a restore rehearsal checklist: - vijay: added restore drill outline to `docs/hermes-operations.md`. - - [ ] clone backup repo into a temporary directory - - [ ] run restore script in dry-run mode if available - - [ ] verify config, skills, sessions, cron, memory, and scripts restore into a test profile - - [ ] confirm no raw `.env`, OAuth token, or credential file appears in git -- [x] Add a quarterly restore drill reminder cron job or calendar task. + - [x] clone backup repo into a temporary directory + - vijay: used local clean clone `/root/repos/bytelyst_hostinger_hermes_vm` and restored into `/tmp/hermes-restore-test-root`. + - [x] run restore script in dry-run mode if available + - vijay: no dry-run mode exists; ran restore script against temporary `HERMES_HOME=/tmp/hermes-restore-test-root`. + - [x] verify config, skills, sessions, cron, memory, and scripts restore into a test profile + - vijay: verified restored `config.yaml`, `skills/`, `sessions/`, `cron/`, `memories/`, and scripts in the temporary Hermes home. + - [x] confirm no raw `.env`, OAuth token, or credential file appears in git + - vijay: verified `state.db` absent from restore test and scanned restored `.env` template/config for common token patterns; no hits. +- [ ] Add a quarterly restore drill reminder cron job or calendar task. - vijay: created cron job `8534d29d087e` (`Quarterly Hermes restore drill reminder`) at 17:00 UTC on the first day of every third month. + - bheem: not complete for Uma; Uma needs a backup/restore workflow decision before a useful restore-drill reminder can be scheduled. - [x] Document exact restore commands in a ByteLyst ops doc. - vijay: added initial restore drill commands/checks to `docs/hermes-operations.md`; a full live restore test is still future work. @@ -149,25 +161,33 @@ A healthy ByteLyst Hermes setup should be: - [x] Before upgrading: - vijay: pre-upgrade command checklist added to `docs/hermes-operations.md`. - [x] run backup sync manually - - vijay: root persistent backup cron was active with last run `ok`; root and Uma configs/service units were snapshotted under `/root/hermes-fix-backups/20260527-roadmap-noncreds/` before upgrade. + - vijay: root persistent backup cron was active with last run `ok`; root config/service unit was snapshotted under `/root/hermes-fix-backups/20260527-roadmap-noncreds/` before upgrade. + - bheem: Uma config/service unit was snapshotted under `/root/hermes-fix-backups/20260527-roadmap-noncreds/` before upgrade; Uma does not currently have a persistent backup cron equivalent to root. - [x] capture `hermes --version`, `hermes status --all`, and `hermes config check` - - vijay: captured version/config checks for root and Uma; both show config v24 after Uma doctor migration. + - vijay: captured root version/config checks; root shows config v24. + - bheem: captured Uma version/config checks; Uma shows config v24 after doctor migration. - [x] snapshot config and cron job list - - vijay: copied root/Uma config and systemd unit definitions before upgrade; captured cron list for both profiles. + - vijay: copied root config and systemd unit definition before upgrade; captured root cron list. + - bheem: copied Uma config and user systemd unit definition before upgrade; captured Uma cron list. - [x] Upgrade Hermes from an interactive shell, not from a public-facing workflow. - vijay: documented; no public workflow exposure added. - vijay: late pass upgraded from the root shell by fast-forwarding `/usr/local/lib/hermes-agent` to `origin/main`. - [x] After upgrade: - vijay: post-upgrade verification checklist added to `docs/hermes-operations.md`; actual upgrade still pending. - [x] restart gateway - - vijay: restarted both `hermes-gateway.service` and `uma-hermes-gateway.service`. + - vijay: restarted `hermes-gateway.service`. + - bheem: restarted `uma-hermes-gateway.service`. - [x] run Telegram smoke test - - vijay: direct provider smoke tests passed for root and Uma; live Telegram path remains active via gateway services. + - vijay: direct provider smoke test passed for root; live Telegram path remains active via gateway service. + - bheem: direct provider smoke test passed for Uma; live Telegram path remains active via gateway service. - [x] verify cron still runs - - vijay: `hermes cron list` showed root backup cron active and Uma reminders active before restart; services remained active after restart. + - vijay: `hermes cron list` showed root backup cron active before restart; service remained active after restart. + - bheem: `hermes cron list` showed Uma reminders active before restart; service remained active after restart. - [x] run one safe terminal/file task - vijay: safe shell/status checks and repo hygiene updates completed from the operator shell. - - [ ] run one memory/session-search task + - [x] run one memory/session-search task + - vijay: ran non-destructive `hermes sessions stats`; root reported 59 sessions / 5225 messages. + - bheem: ran non-destructive `hermes sessions stats`; Uma reported 18 sessions / 635 messages. - [x] Record upgrade date, version, and any manual fixups in `docs/operations.md` or a Hermes-specific ops note. - vijay: created `docs/hermes-operations.md` as the Hermes-specific ops note. - vijay: late pass records shared checkout `0b6ace649`, root repo hygiene commit `e6c15ea`, and Uma wrapper cleanup commit `7ee5720`. @@ -175,7 +195,8 @@ A healthy ByteLyst Hermes setup should be: ### Phase 4 — Provider And Model Resilience - [x] Keep OpenAI Codex OAuth as the primary provider if it remains stable. - - vijay: root and Uma both remain on `openai-codex` with `gpt-5.5`; routing stays disabled after the earlier `gpt-5.4-mini` failure path. + - vijay: root remains on `openai-codex` with `gpt-5.5`; routing stays disabled after the earlier `gpt-5.4-mini` failure path. + - bheem: Uma remains on `openai-codex` with `gpt-5.5`; routing stays disabled after the earlier `gpt-5.4-mini` failure path. - [ ] Add at least one fallback provider for resilience: - [ ] OpenRouter - [ ] Google/Gemini @@ -278,6 +299,7 @@ A healthy ByteLyst Hermes setup should be: - vijay: no public dashboard/API route added; private-only policy documented. - [x] If a dashboard is useful, make it private-only and operationally scoped. - vijay: selected private-only dashboard direction; installed Tailscale daemon for future private access. Dashboard itself is not running and no `9119/9120` listener is exposed. + - bheem: Uma dashboard access should use the same private-only host path after Tailscale login; no Uma dashboard listener is exposed. - [ ] Dashboard should show: - [ ] gateway status - [ ] active sessions @@ -287,6 +309,7 @@ A healthy ByteLyst Hermes setup should be: - [ ] quick links to docs/runbooks - [x] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel. - vijay: standing decision is local/Tailscale/SSH-only. Tailnet login and dashboard auth validation remain tomorrow tasks. + - bheem: same standing decision for Uma; no public dashboard route should be added. - [x] Add a Caddy review step before adding any new hostname. - vijay: added Caddy/port review commands to `docs/hermes-operations.md`. @@ -308,13 +331,16 @@ A healthy ByteLyst Hermes setup should be: - [x] Reconfirm raw `.env`, OAuth credentials, tokens, logs, and SQLite WAL/SHM files are excluded from git backups. - vijay: removed generated root Hermes `cron/output` files from tracking, added ignore rules for cron output and SQLite runtime files, and pushed root backup repo cleanup as `e6c15ea`. + - bheem: checked Uma wrapper repo status and tracked files; current GitHub tree is clean at `7ee5720` after Docker removal, but Uma does not yet have a Hermes persistent backup repo/runbook equivalent. - [ ] Consider enabling `security.redact_secrets` if the operational tradeoff is acceptable. - [ ] Keep `privacy.redact_pii` decision documented for gateway sessions. - [ ] Rotate old credentials after migration or accidental exposure risk. - [ ] Use least-privilege tokens for GitHub/Gitea, web APIs, and provider keys. - [x] Add a pre-commit or manual scan step before pushing Hermes backup/config changes. - vijay: added manual scan/review step in practice during root/Uma repo pushes; root backup repo now ignores generated cron outputs that previously carried noisy token-pattern scan results. -- [ ] Keep approval mode at `manual` or `smart` for Telegram-driven work. +- [x] Keep approval mode at `manual` or `smart` for Telegram-driven work. + - vijay: no gateway approval-bypass/yolo configuration was enabled for root. + - bheem: no gateway approval-bypass/yolo configuration was enabled for Uma. ### Phase 12 — Documentation And Runbooks @@ -332,6 +358,7 @@ A healthy ByteLyst Hermes setup should be: - [x] private-only dashboard access - [x] Keep commands copy-pasteable and include expected outputs. - vijay: copied operational commands into `docs/hermes-operations.md`; expected-output notes included where useful. + - vijay: late pass expanded `docs/hermes-operations.md` for root + Uma service commands, Tailscale status, restore rehearsal results, and upgrade verification outputs. - [x] Store secrets only as placeholder variable names or `.env.example` entries. - vijay: no raw secrets were added to docs or scripts. @@ -351,6 +378,8 @@ A healthy ByteLyst Hermes setup should be: - [ ] Document provider routing and model defaults. - [x] Add gateway recovery runbook. - [ ] Add restore drill runbook and perform one test-profile restore. + - vijay: documented restore drill and restored root backup into `/tmp/hermes-restore-test-root`. + - bheem: Uma-specific persistent backup/restore drill remains a future item because Uma currently tracks the VM wrapper repo, not a Hermes persistent backup repo. - [ ] Add Gitea/GitHub least-privilege automation credential path. ### Medium-Term — This Month @@ -360,17 +389,22 @@ A healthy ByteLyst Hermes setup should be: - [ ] Add silent-on-success system watchdogs. - [ ] Clean up stale memory/skills and pin critical skills. - [ ] Schedule quarterly restore drills. + - vijay: quarterly restore drill reminder cron is configured for root. + - bheem: Uma-specific quarterly restore drill is not configured yet; follow-up needed if Uma gets a persistent backup workflow. ## Acceptance Criteria This roadmap is complete when: -- [ ] Hermes can be upgraded and rolled back/restored with a documented process. +- [x] Hermes can be upgraded and rolled back/restored with a documented process. + - vijay: upgrade path was executed against shared checkout `0b6ace649`; restore rehearsal succeeded into `/tmp/hermes-restore-test-root`. Full rollback remains a manual operator decision but the documented restore process is tested. - [x] Gateway failures and backup failures notify Telegram. - [ ] At least one fallback model/provider is configured and tested. - [ ] Web/search tooling works for current research tasks. - [x] No Hermes dashboard/API is publicly exposed. - [ ] Backup restore has been tested into a non-production profile. + - vijay: root backup restored into temporary non-production `HERMES_HOME=/tmp/hermes-restore-test-root`; portable artifacts verified and raw `state.db` absent. + - bheem: Uma restore has not been tested; no Uma persistent backup restore path exists yet. - [x] Core ByteLyst Hermes procedures exist as docs or skills. - [x] Sensitive files remain untracked and backup-safe. @@ -389,14 +423,23 @@ This roadmap is complete when: ### 2026-05-27 — vijay late non-credential completion pass - vijay: extended scope to both root and Uma instances where the action did not require new credentials. -- vijay: backed up root/Uma configs and systemd units to `/root/hermes-fix-backups/20260527-roadmap-noncreds/`. -- vijay: migrated Uma Hermes config v23 → v24 with `hermes doctor --fix`; root was already v24. +- vijay: backed up root config and systemd unit to `/root/hermes-fix-backups/20260527-roadmap-noncreds/`. +- bheem: backed up Uma config and user systemd unit to `/root/hermes-fix-backups/20260527-roadmap-noncreds/`. +- bheem: migrated Uma Hermes config v23 → v24 with `hermes doctor --fix`. +- vijay: root was already config v24. - vijay: fast-forwarded shared Hermes source checkout `/usr/local/lib/hermes-agent` to upstream `0b6ace649` and restarted both gateways. -- vijay: verified root and Uma provider smoke tests: `root-roadmap-ok`, `uma-roadmap-ok`. -- vijay: confirmed both services are enabled and active; Docker-based Uma Hermes remains removed. +- vijay: verified root provider smoke test: `root-roadmap-ok`. +- bheem: verified Uma provider smoke test: `uma-roadmap-ok`. +- vijay: confirmed root service is enabled and active. +- bheem: confirmed Uma service is enabled and active; Docker-based Uma Hermes remains removed. - vijay: installed Tailscale `1.98.3`; `tailscaled` is enabled/running and awaits tailnet login. - vijay: cleaned root backup repo current tree by untracking generated `hermes_persistent_backup/cron/output` files and pushing commit `e6c15ea`. -- vijay: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal. +- bheem: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal. +- vijay: ran root restore rehearsal into `/tmp/hermes-restore-test-root`, verified portable restore content, and scanned restored config/template for common token patterns. +- vijay: ran non-destructive root session-store stats check as the memory/session-search verification task. +- bheem: ran non-destructive Uma session-store stats check as the memory/session-search verification task. +- vijay: updated `docs/hermes-operations.md` with root service commands, Tailscale status, restore rehearsal outcome, and late upgrade notes. +- bheem: updated `docs/hermes-operations.md` with Uma service commands and shared private-dashboard notes. ## Notes For Future Transcript Pass