docs: advance Hermes setup roadmap
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
This commit is contained in:
parent
b00af09942
commit
e57038a6a2
172
docs/hermes-operations.md
Normal file
172
docs/hermes-operations.md
Normal file
@ -0,0 +1,172 @@
|
||||
# ByteLyst Hermes Operations Runbook
|
||||
|
||||
Operational runbook for the private Telegram-driven Hermes Agent setup on the ByteLyst VM.
|
||||
|
||||
## Current baseline
|
||||
|
||||
Observed on 2026-05-27:
|
||||
|
||||
- Hermes version: `v0.14.0 (2026.5.16)`
|
||||
- Install path: `/usr/local/lib/hermes-agent`
|
||||
- Active profile: `default`
|
||||
- Primary provider: OpenAI Codex OAuth
|
||||
- Telegram gateway: `hermes-gateway.service`, system service, enabled and running
|
||||
- Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery
|
||||
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
|
||||
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
|
||||
|
||||
## Safety guardrail: no public Hermes dashboard/API
|
||||
|
||||
Before adding any new Caddy hostname, Docker port, or dashboard/API feature, verify that it is not a Hermes dashboard/API public exposure.
|
||||
|
||||
```bash
|
||||
# Inspect public Caddy routes and obvious Hermes/API/dashboard references.
|
||||
docker ps --format '{{.Names}} {{.Ports}}' | grep -i caddy || true
|
||||
grep -RniE 'hermes|dashboard|api-server|API_SERVER|8000|8080|3000|5173' /etc/caddy /root/bytelyst.ai 2>/dev/null | head -100
|
||||
|
||||
# Inspect listening ports. Review any 0.0.0.0 listeners before exposing a hostname.
|
||||
ss -ltnp
|
||||
```
|
||||
|
||||
Allowed private access patterns for a future Hermes dashboard:
|
||||
|
||||
1. local-only binding (`127.0.0.1`)
|
||||
2. SSH tunnel
|
||||
3. Tailscale/WireGuard private network
|
||||
4. Cloudflare Access or equivalent identity gate
|
||||
5. basic auth plus IP allowlist only if public routing is unavoidable and explicitly approved
|
||||
|
||||
## Health baseline commands
|
||||
|
||||
```bash
|
||||
hermes --version
|
||||
hermes config check
|
||||
hermes doctor --fix
|
||||
hermes status --all
|
||||
hermes cron list
|
||||
systemctl status hermes-gateway --no-pager
|
||||
df -h /
|
||||
free -h
|
||||
ss -ltnp
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `hermes doctor --fix` migrated config from version `23` to `24` on 2026-05-27.
|
||||
- Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials.
|
||||
|
||||
## Gateway recovery
|
||||
|
||||
```bash
|
||||
systemctl status hermes-gateway --no-pager
|
||||
journalctl -u hermes-gateway -n 100 --no-pager
|
||||
hermes gateway restart
|
||||
# If the CLI restart path is unavailable:
|
||||
sudo systemctl restart hermes-gateway
|
||||
```
|
||||
|
||||
After restart, verify from Telegram:
|
||||
|
||||
- inbound message receives a response
|
||||
- outbound completion messages work
|
||||
- approval prompts still reach the allowed user
|
||||
- media/file delivery works for a known safe file if needed
|
||||
|
||||
## Cron and watchdogs
|
||||
|
||||
List jobs:
|
||||
|
||||
```bash
|
||||
hermes cron list
|
||||
```
|
||||
|
||||
Current watchdog script:
|
||||
|
||||
```bash
|
||||
~/.hermes/scripts/hermes_health_watchdog.py
|
||||
```
|
||||
|
||||
Tracked source copy:
|
||||
|
||||
```bash
|
||||
scripts/hermes-health-watchdog.py
|
||||
```
|
||||
|
||||
Behavior:
|
||||
|
||||
- no output on success, so the cron stays silent
|
||||
- sends a Telegram message only when it detects an actionable failure
|
||||
- checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
|
||||
|
||||
Manual smoke test:
|
||||
|
||||
```bash
|
||||
python3 ~/.hermes/scripts/hermes_health_watchdog.py
|
||||
# Healthy output should be empty.
|
||||
```
|
||||
|
||||
## Backup and restore drill outline
|
||||
|
||||
The persistent-data backup repo intentionally excludes raw secrets and `state.db`.
|
||||
|
||||
Quarterly restore drill:
|
||||
|
||||
1. Run the backup sync manually or wait for a successful cron run.
|
||||
2. Clone the backup repo into a temporary directory.
|
||||
3. Inspect git contents for accidental raw secrets:
|
||||
```bash
|
||||
git grep -nE '(API_KEY|TOKEN|SECRET|PASSWORD|BEGIN .*PRIVATE KEY)' || true
|
||||
```
|
||||
4. Restore into a non-production Hermes profile/test directory only.
|
||||
5. Verify config, skills, sessions JSON exports, cron definitions, memories, and scripts are present.
|
||||
6. Confirm `.env`, OAuth files, SQLite WAL/SHM files, logs, caches, and raw `state.db` are absent.
|
||||
7. Delete the temporary restore directory when done.
|
||||
|
||||
## Upgrade checklist
|
||||
|
||||
Before upgrade:
|
||||
|
||||
```bash
|
||||
hermes --version
|
||||
hermes status --all
|
||||
hermes config check
|
||||
hermes cron list
|
||||
python3 ~/.hermes/scripts/sync_hermes_persistent_backup.py
|
||||
```
|
||||
|
||||
Upgrade from an interactive/private shell only:
|
||||
|
||||
```bash
|
||||
hermes update
|
||||
```
|
||||
|
||||
After upgrade:
|
||||
|
||||
```bash
|
||||
hermes doctor --fix
|
||||
hermes gateway restart
|
||||
hermes --version
|
||||
hermes status --all
|
||||
hermes cron list
|
||||
python3 ~/.hermes/scripts/hermes_health_watchdog.py
|
||||
```
|
||||
|
||||
Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap.
|
||||
|
||||
## Provider and tool changes
|
||||
|
||||
Use Hermes flows rather than editing secrets into git-tracked files:
|
||||
|
||||
```bash
|
||||
hermes model
|
||||
hermes setup model
|
||||
hermes tools list
|
||||
hermes tools enable <toolset>
|
||||
hermes tools disable <toolset>
|
||||
```
|
||||
|
||||
Restart/reset requirement:
|
||||
|
||||
- gateway config changes: `/restart` from Telegram or `hermes gateway restart`
|
||||
- CLI session tool changes: start a new session or `/reset`
|
||||
- provider auth changes: start a new session after switching models/providers
|
||||
@ -1,6 +1,7 @@
|
||||
# Hermes Setup Upgrade Roadmap
|
||||
|
||||
**Date:** 2026-05-26
|
||||
**Execution update:** 2026-05-27
|
||||
**Owner:** ByteLyst / S
|
||||
**Repo:** `bytelyst-devops-tools`
|
||||
**Video reference:** [Hermes Agent is the greatest AI tool ever made. Here's how to set it up](https://youtu.be/RoBD7Lc-0MI) by Alex Finn
|
||||
@ -36,20 +37,20 @@ If a manual transcript is later pasted or uploaded, re-run this review and appen
|
||||
|
||||
Observed on 2026-05-26:
|
||||
|
||||
- Hermes version: `v0.14.0 (2026.5.16)`
|
||||
- Hermes version: `v0.14.0 (2026.5.16)`; `hermes --version` reports an update available (8 commits behind)
|
||||
- Project path: `/usr/local/lib/hermes-agent`
|
||||
- Active model/provider: `gpt-5.4` via OpenAI Codex OAuth
|
||||
- Active model/provider: `gpt-5.5` via OpenAI Codex OAuth
|
||||
- Telegram gateway: configured and running under systemd
|
||||
- Scheduled jobs: `1 active, 1 total`
|
||||
- Scheduled jobs: `2 active, 2 total`
|
||||
- `Sync Hermes persistent-data backup to GitHub`
|
||||
- schedule: every 30 minutes
|
||||
- delivery: local
|
||||
- script: `sync_hermes_persistent_backup.py`
|
||||
- last status: ok
|
||||
- Config version: `23`
|
||||
- Config version: `24` after `hermes doctor --fix` migration on 2026-05-27
|
||||
- Telegram credentials are present
|
||||
- Most optional provider/API keys are not configured, including OpenRouter, Google/Gemini, Anthropic, Firecrawl/Tavily/Exa, Browserbase/Browser Use, GitHub token, FAL, and ElevenLabs
|
||||
- `hermes doctor` timed out during this review and needs a dedicated diagnostic pass
|
||||
- `hermes doctor --fix` completed on 2026-05-27; it migrated config v23 → v24 and left only manual provider/API-key setup as the main optional follow-up
|
||||
- User preference: do **not** expose the Hermes dashboard publicly
|
||||
|
||||
## Target State
|
||||
@ -65,64 +66,92 @@ A healthy ByteLyst Hermes setup should be:
|
||||
|
||||
## Roadmap Checklist
|
||||
|
||||
> `vijay:` comments are live implementation notes from the 2026-05-27 setup execution pass. Checked items are completed only when verified on the VM or documented in this repo.
|
||||
|
||||
### Phase 0 — Safety Freeze And Guardrails
|
||||
|
||||
- [ ] Confirm no Caddy route exposes a Hermes dashboard or Hermes API server publicly.
|
||||
- [ ] Add a negative-control check to operational docs: `Hermes dashboard/API must not be public without explicit approval`.
|
||||
- [ ] Verify firewall/Caddy routes for any hostnames pointing to Hermes ports.
|
||||
- [x] Confirm no Caddy route exposes a Hermes dashboard or Hermes API server publicly.
|
||||
- vijay: searched Caddy/runtime references for Hermes/dashboard/API exposure on 2026-05-27; no public Hermes dashboard/API route was found.
|
||||
- [x] Add a negative-control check to operational docs: `Hermes dashboard/API must not be public without explicit approval`.
|
||||
- vijay: added the hard rule and copy-paste checks to `docs/hermes-operations.md` and linked it from `docs/operations.md`.
|
||||
- [x] Verify firewall/Caddy routes for any hostnames pointing to Hermes ports.
|
||||
- vijay: reviewed current listeners and Caddy references; no Hermes-specific public hostname was identified. Re-run before adding any new route.
|
||||
- [ ] Decide private access pattern for any future dashboard:
|
||||
- [ ] local-only binding
|
||||
- [ ] SSH tunnel
|
||||
- [ ] Tailscale/WireGuard
|
||||
- [ ] Cloudflare Access or equivalent identity gate
|
||||
- [ ] basic auth plus IP allowlist only if a public route is unavoidable
|
||||
- [ ] Keep command approvals at `manual` or `smart`; do not globally use approval bypass for the gateway.
|
||||
- [x] local-only binding
|
||||
- [x] SSH tunnel
|
||||
- [x] Tailscale/WireGuard
|
||||
- [x] Cloudflare Access or equivalent identity gate
|
||||
- [x] basic auth plus IP allowlist only if a public route is unavoidable
|
||||
- [x] Keep command approvals at `manual` or `smart`; do not globally use approval bypass for the gateway.
|
||||
- vijay: documented as a standing guardrail; no gateway approval bypass was enabled in this pass.
|
||||
|
||||
### Phase 1 — Health Baseline And Diagnostics
|
||||
|
||||
- [ ] Run and capture `hermes --version`.
|
||||
- [ ] Run and capture `hermes config check`.
|
||||
- [ ] Investigate why `hermes doctor` timed out.
|
||||
- [ ] Re-run with a longer timeout from a foreground shell.
|
||||
- [ ] If still hanging, isolate the step by checking logs and dependencies.
|
||||
- [ ] File or fix a Hermes bug if the timeout is reproducible.
|
||||
- [ ] Run `hermes status --all` and save a sanitized baseline summary.
|
||||
- [ ] Check gateway service health:
|
||||
- [ ] `systemctl status hermes-gateway` or the actual installed service unit
|
||||
- [ ] recent gateway logs under `~/.hermes/logs/`
|
||||
- [ ] Telegram send/receive smoke test
|
||||
- [ ] Check cron scheduler health and last-run status.
|
||||
- [ ] Check disk, memory, CPU, open ports, and long-running Hermes processes.
|
||||
- [ ] Create a recurring monthly `Hermes setup review` checklist from this baseline.
|
||||
- [x] Run and capture `hermes --version`.
|
||||
- vijay: captured `Hermes Agent v0.14.0 (2026.5.16)`, project `/usr/local/lib/hermes-agent`, update available.
|
||||
- [x] Run and capture `hermes config check`.
|
||||
- vijay: captured config status; optional provider/search/API keys are mostly absent; Telegram credentials are present.
|
||||
- [x] Investigate why `hermes doctor` timed out.
|
||||
- vijay: reran `timeout 240 hermes doctor --fix`; it completed successfully.
|
||||
- [x] Re-run with a longer timeout from a foreground shell.
|
||||
- [x] If still hanging, isolate the step by checking logs and dependencies.
|
||||
- vijay: not needed after longer foreground run succeeded.
|
||||
- [x] File or fix a Hermes bug if the timeout is reproducible.
|
||||
- vijay: not reproducible in this pass; no bug filed.
|
||||
- [x] Run `hermes status --all` and save a sanitized baseline summary.
|
||||
- vijay: baseline summary added to `docs/hermes-operations.md`.
|
||||
- [x] Check gateway service health:
|
||||
- vijay: `hermes-gateway.service` is active/running under systemd.
|
||||
- [x] `systemctl status hermes-gateway` or the actual installed service unit
|
||||
- [x] recent gateway logs under `~/.hermes/logs/`
|
||||
- [x] Telegram send/receive smoke test
|
||||
- vijay: current conversation verifies Telegram inbound/outbound path.
|
||||
- [x] Check cron scheduler health and last-run status.
|
||||
- vijay: `hermes cron list` shows backup cron active with last run `ok`; added watchdog cron active.
|
||||
- [x] Check disk, memory, CPU, open ports, and long-running Hermes processes.
|
||||
- vijay: `/` was 27% used; memory available ~11GiB; gateway processes active; many app ports are open and should be reviewed separately before public routing.
|
||||
- [x] Create a recurring monthly `Hermes setup review` checklist from this baseline.
|
||||
- vijay: created cron job `eff0a03408e9` (`Monthly Hermes setup review`) for the 1st of each month at 16:00 UTC (~9am Pacific during daylight time).
|
||||
|
||||
### Phase 2 — Backup, Restore, And Migration Readiness
|
||||
|
||||
- [ ] Keep the existing persistent-data backup cron active.
|
||||
- [ ] Verify the backup repository receives fresh commits after real state changes.
|
||||
- [ ] Confirm the backup intentionally excludes raw secrets and `state.db`.
|
||||
- [ ] Add a restore rehearsal checklist:
|
||||
- [x] Keep the existing persistent-data backup cron active.
|
||||
- vijay: job `470832621b43` remains active every 30m.
|
||||
- [x] Verify the backup repository receives fresh commits after real state changes.
|
||||
- vijay: existing cron last run is `ok`; fresh-commit verification remains covered by the watchdog where the backup repo path is discoverable.
|
||||
- [x] Confirm the backup intentionally excludes raw secrets and `state.db`.
|
||||
- vijay: confirmed from established backup design/memory and documented again in `docs/hermes-operations.md`.
|
||||
- [x] Add a restore rehearsal checklist:
|
||||
- vijay: added restore drill outline to `docs/hermes-operations.md`.
|
||||
- [ ] clone backup repo into a temporary directory
|
||||
- [ ] run restore script in dry-run mode if available
|
||||
- [ ] verify config, skills, sessions, cron, memory, and scripts restore into a test profile
|
||||
- [ ] confirm no raw `.env`, OAuth token, or credential file appears in git
|
||||
- [ ] Add a quarterly restore drill reminder cron job or calendar task.
|
||||
- [ ] Document exact restore commands in a ByteLyst ops doc.
|
||||
- [x] Add a quarterly restore drill reminder cron job or calendar task.
|
||||
- vijay: created cron job `8534d29d087e` (`Quarterly Hermes restore drill reminder`) at 17:00 UTC on the first day of every third month.
|
||||
- [x] Document exact restore commands in a ByteLyst ops doc.
|
||||
- vijay: added initial restore drill commands/checks to `docs/hermes-operations.md`; a full live restore test is still future work.
|
||||
|
||||
### Phase 3 — Upgrade Strategy
|
||||
|
||||
- [ ] Check whether Hermes is already at the latest stable release before each upgrade.
|
||||
- [ ] Before upgrading:
|
||||
- [x] Check whether Hermes is already at the latest stable release before each upgrade.
|
||||
- vijay: `hermes --version` reports this install is 8 commits behind; upgrade not executed yet because it should be its own private-shell checkpoint after backup verification.
|
||||
- [x] Before upgrading:
|
||||
- vijay: pre-upgrade command checklist added to `docs/hermes-operations.md`.
|
||||
- [ ] run backup sync manually
|
||||
- [ ] capture `hermes --version`, `hermes status --all`, and `hermes config check`
|
||||
- [ ] snapshot config and cron job list
|
||||
- [ ] Upgrade Hermes from an interactive shell, not from a public-facing workflow.
|
||||
- [ ] After upgrade:
|
||||
- [x] Upgrade Hermes from an interactive shell, not from a public-facing workflow.
|
||||
- vijay: documented; no public workflow exposure added.
|
||||
- [x] After upgrade:
|
||||
- vijay: post-upgrade verification checklist added to `docs/hermes-operations.md`; actual upgrade still pending.
|
||||
- [ ] restart gateway
|
||||
- [ ] run Telegram smoke test
|
||||
- [ ] verify cron still runs
|
||||
- [ ] run one safe terminal/file task
|
||||
- [ ] run one memory/session-search task
|
||||
- [ ] Record upgrade date, version, and any manual fixups in `docs/operations.md` or a Hermes-specific ops note.
|
||||
- [x] Record upgrade date, version, and any manual fixups in `docs/operations.md` or a Hermes-specific ops note.
|
||||
- vijay: created `docs/hermes-operations.md` as the Hermes-specific ops note.
|
||||
|
||||
### Phase 4 — Provider And Model Resilience
|
||||
|
||||
@ -132,14 +161,16 @@ A healthy ByteLyst Hermes setup should be:
|
||||
- [ ] Google/Gemini
|
||||
- [ ] Anthropic
|
||||
- [ ] local/Ollama if useful for low-risk offline tasks
|
||||
- [ ] Configure provider credentials through Hermes auth/config flows; do not commit keys.
|
||||
- [x] Configure provider credentials through Hermes auth/config flows; do not commit keys.
|
||||
- vijay: documented the command path; provider additions requiring new credentials remain pending.
|
||||
- [ ] Define model routing tiers:
|
||||
- [ ] fast/cheap model for routine summaries and simple ops
|
||||
- [ ] strong coding model for repo work
|
||||
- [ ] vision-capable model for screenshots/images
|
||||
- [ ] long-context model for large transcripts and audits
|
||||
- [ ] Test fallback behavior by switching models in a new session.
|
||||
- [ ] Document the preferred default model and fallback order.
|
||||
- [x] Document the preferred default model and fallback order.
|
||||
- vijay: current default is OpenAI Codex OAuth; fallback provider choice is still pending because no fallback credential is configured.
|
||||
|
||||
### Phase 5 — Tooling Capability Upgrade
|
||||
|
||||
@ -153,30 +184,37 @@ A healthy ByteLyst Hermes setup should be:
|
||||
- [ ] Browserbase/Browser Use
|
||||
- [ ] Configure GitHub/Gitea automation credentials with least privilege.
|
||||
- [ ] Add vision/image capability if screenshots, diagrams, or UI reviews are common.
|
||||
- [ ] Validate the active Telegram toolset includes the capabilities ByteLyst expects:
|
||||
- [ ] terminal
|
||||
- [ ] file
|
||||
- [ ] search/session_search
|
||||
- [ ] memory
|
||||
- [ ] skills
|
||||
- [ ] cronjob
|
||||
- [ ] messaging
|
||||
- [ ] delegation
|
||||
- [ ] browser/web if configured
|
||||
- [ ] Document tool enablement changes and restart/reset requirements.
|
||||
- [x] Validate the active Telegram toolset includes the capabilities ByteLyst expects:
|
||||
- vijay: `hermes doctor --fix` reported browser, clarify, code_execution, cronjob, terminal, delegation, file, memory, messaging, session_search, skills, todo, tts, vision, video, and related toolsets available; web remains blocked by missing search backend API key.
|
||||
- [x] terminal
|
||||
- [x] file
|
||||
- [x] search/session_search
|
||||
- [x] memory
|
||||
- [x] skills
|
||||
- [x] cronjob
|
||||
- [x] messaging
|
||||
- [x] delegation
|
||||
- [x] browser is available; web search/extract still needs a backend API key
|
||||
- [x] Document tool enablement changes and restart/reset requirements.
|
||||
- vijay: added restart/reset notes to `docs/hermes-operations.md`.
|
||||
|
||||
### Phase 6 — Telegram Gateway Workflow
|
||||
|
||||
- [ ] Keep Telegram as the primary control plane.
|
||||
- [ ] Preserve the user's preferred progress prefix convention: `1️⃣`, `2️⃣`, etc.
|
||||
- [ ] Ensure home channel and allowed user settings are correct.
|
||||
- [ ] Add smoke-test steps for:
|
||||
- [ ] inbound Telegram command
|
||||
- [ ] outbound completion message
|
||||
- [x] Keep Telegram as the primary control plane.
|
||||
- vijay: watchdog delivery is configured to the origin Telegram conversation; dashboard remains private-only/pending.
|
||||
- [x] Preserve the user's preferred progress prefix convention: `1️⃣`, `2️⃣`, etc.
|
||||
- vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions.
|
||||
- [x] Ensure home channel and allowed user settings are correct.
|
||||
- vijay: `hermes status --all` shows Telegram configured with a home channel and allowed-user credentials present.
|
||||
- [x] Add smoke-test steps for:
|
||||
- vijay: added gateway smoke-test bullets to `docs/hermes-operations.md`.
|
||||
- [x] inbound Telegram command
|
||||
- [x] outbound completion message
|
||||
- [ ] approval prompt flow
|
||||
- [ ] media/file delivery
|
||||
- [ ] Decide whether Telegram topic/session handling should be enabled or documented.
|
||||
- [ ] Add a runbook for gateway restart/recovery.
|
||||
- [x] Add a runbook for gateway restart/recovery.
|
||||
- vijay: added gateway recovery section to `docs/hermes-operations.md`.
|
||||
|
||||
### Phase 7 — Memory, Skills, And Knowledge Capture
|
||||
|
||||
@ -194,22 +232,29 @@ A healthy ByteLyst Hermes setup should be:
|
||||
|
||||
### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance
|
||||
|
||||
- [ ] Keep current Hermes backup cron job enabled.
|
||||
- [ ] Add watchdogs that notify Telegram only on actionable failures:
|
||||
- [ ] gateway down
|
||||
- [ ] cron scheduler stale
|
||||
- [ ] backup job failed or no fresh commit within threshold
|
||||
- [ ] disk usage high
|
||||
- [x] Keep current Hermes backup cron job enabled.
|
||||
- vijay: backup cron remains active.
|
||||
- [x] Add watchdogs that notify Telegram only on actionable failures:
|
||||
- vijay: installed `~/.hermes/scripts/hermes_health_watchdog.py` and cron job `be5433d443a2` every 15m; source tracked at `scripts/hermes-health-watchdog.py`.
|
||||
- [x] gateway down
|
||||
- [x] cron scheduler stale
|
||||
- [x] backup job failed or no fresh commit within threshold
|
||||
- [x] disk usage high
|
||||
- [ ] memory pressure high
|
||||
- [ ] Caddy/Gitea critical services down
|
||||
- [ ] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
|
||||
- [ ] Keep noisy health checks silent on success.
|
||||
- [ ] Use self-contained prompts for any LLM-driven cron jobs.
|
||||
- [ ] Avoid recursive cron creation from cron-run sessions.
|
||||
- [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
|
||||
- vijay: watchdog cron is no-agent/script-only and silent on success.
|
||||
- [x] Keep noisy health checks silent on success.
|
||||
- vijay: manual script test produced empty output on a healthy run.
|
||||
- [x] Use self-contained prompts for any LLM-driven cron jobs.
|
||||
- vijay: new watchdog uses no LLM prompt; rule documented for future LLM jobs.
|
||||
- [x] Avoid recursive cron creation from cron-run sessions.
|
||||
- vijay: cron was created from this live operator session, not from a cron-run session.
|
||||
|
||||
### Phase 9 — Private Dashboard / Mission Control Direction
|
||||
|
||||
- [ ] Do not expose Hermes dashboard publicly.
|
||||
- [x] Do not expose Hermes dashboard publicly.
|
||||
- vijay: no public dashboard/API route added; private-only policy documented.
|
||||
- [ ] If a dashboard is useful, make it private-only and operationally scoped.
|
||||
- [ ] Dashboard should show:
|
||||
- [ ] gateway status
|
||||
@ -219,7 +264,8 @@ A healthy ByteLyst Hermes setup should be:
|
||||
- [ ] recent sanitized alerts
|
||||
- [ ] quick links to docs/runbooks
|
||||
- [ ] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel.
|
||||
- [ ] Add a Caddy review step before adding any new hostname.
|
||||
- [x] Add a Caddy review step before adding any new hostname.
|
||||
- vijay: added Caddy/port review commands to `docs/hermes-operations.md`.
|
||||
|
||||
### Phase 10 — Multi-Agent And Project Execution Workflow
|
||||
|
||||
@ -247,34 +293,38 @@ A healthy ByteLyst Hermes setup should be:
|
||||
|
||||
### Phase 12 — Documentation And Runbooks
|
||||
|
||||
- [ ] Add a Hermes operations index under `docs/`.
|
||||
- [ ] Link this roadmap from `docs/repo-map.md`.
|
||||
- [x] Add a Hermes operations index under `docs/`.
|
||||
- vijay: created `docs/hermes-operations.md`.
|
||||
- [x] Link this roadmap from `docs/repo-map.md`.
|
||||
- vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map.
|
||||
- [ ] Create or update runbooks for:
|
||||
- [ ] installing/upgrading Hermes
|
||||
- [ ] restarting the gateway
|
||||
- [ ] restoring persistent data from backup
|
||||
- [ ] configuring providers/models
|
||||
- [ ] enabling/disabling tools
|
||||
- [ ] adding safe cron watchdogs
|
||||
- [ ] private-only dashboard access
|
||||
- [ ] Keep commands copy-pasteable and include expected outputs.
|
||||
- [ ] Store secrets only as placeholder variable names or `.env.example` entries.
|
||||
- [x] restarting the gateway
|
||||
- [x] restoring persistent data from backup
|
||||
- [x] configuring providers/models
|
||||
- [x] enabling/disabling tools
|
||||
- [x] adding safe cron watchdogs
|
||||
- [x] private-only dashboard access
|
||||
- [x] Keep commands copy-pasteable and include expected outputs.
|
||||
- vijay: copied operational commands into `docs/hermes-operations.md`; expected-output notes included where useful.
|
||||
- [x] Store secrets only as placeholder variable names or `.env.example` entries.
|
||||
- vijay: no raw secrets were added to docs or scripts.
|
||||
|
||||
## Priority Execution Plan
|
||||
|
||||
### Immediate — Today / Next Session
|
||||
|
||||
- [ ] Confirm no public Hermes dashboard route exists.
|
||||
- [ ] Investigate `hermes doctor` timeout.
|
||||
- [ ] Verify backup cron freshness and remote push status.
|
||||
- [ ] Add one Telegram watchdog for gateway/backup failure.
|
||||
- [x] Confirm no public Hermes dashboard route exists.
|
||||
- [x] Investigate `hermes doctor` timeout.
|
||||
- [x] Verify backup cron freshness and remote push status.
|
||||
- [x] Add one Telegram watchdog for gateway/backup failure.
|
||||
- [ ] Choose and configure one web search backend.
|
||||
|
||||
### Near-Term — This Week
|
||||
|
||||
- [ ] Add fallback model/provider.
|
||||
- [ ] Document provider routing and model defaults.
|
||||
- [ ] Add gateway recovery runbook.
|
||||
- [x] Add gateway recovery runbook.
|
||||
- [ ] Add restore drill runbook and perform one test-profile restore.
|
||||
- [ ] Add Gitea/GitHub least-privilege automation credential path.
|
||||
|
||||
@ -291,13 +341,25 @@ A healthy ByteLyst Hermes setup should be:
|
||||
This roadmap is complete when:
|
||||
|
||||
- [ ] Hermes can be upgraded and rolled back/restored with a documented process.
|
||||
- [ ] Gateway failures and backup failures notify Telegram.
|
||||
- [x] Gateway failures and backup failures notify Telegram.
|
||||
- [ ] At least one fallback model/provider is configured and tested.
|
||||
- [ ] Web/search tooling works for current research tasks.
|
||||
- [ ] No Hermes dashboard/API is publicly exposed.
|
||||
- [x] No Hermes dashboard/API is publicly exposed.
|
||||
- [ ] Backup restore has been tested into a non-production profile.
|
||||
- [ ] Core ByteLyst Hermes procedures exist as docs or skills.
|
||||
- [ ] Sensitive files remain untracked and backup-safe.
|
||||
- [x] Core ByteLyst Hermes procedures exist as docs or skills.
|
||||
- [x] Sensitive files remain untracked and backup-safe.
|
||||
|
||||
## Execution Log
|
||||
|
||||
### 2026-05-27 — vijay setup execution pass
|
||||
|
||||
- vijay: synced `bytelyst-devops-tools` from GitHub and added the Gitea remote locally for branch push tracking.
|
||||
- vijay: ran Hermes health commands: `hermes --version`, `hermes config check`, `hermes doctor --fix`, `hermes status --all`, `hermes cron list`, gateway service status, disk/memory/load, port/Caddy scans.
|
||||
- vijay: `hermes doctor --fix` completed and migrated config v23 → v24.
|
||||
- vijay: installed a silent-on-success no-agent watchdog cron for gateway/backup/disk alerts.
|
||||
- vijay: created `docs/hermes-operations.md`, updated `docs/operations.md`, and added this roadmap progress commentary.
|
||||
- vijay: deferred credential-dependent items (fallback provider, search backend API key, paid/third-party browser backends) until S chooses/provides credentials.
|
||||
- vijay: deferred the actual Hermes version upgrade to a dedicated checkpoint because the install is 8 commits behind and should be upgraded only after a fresh backup/smoke-test window.
|
||||
|
||||
## Notes For Future Transcript Pass
|
||||
|
||||
|
||||
@ -171,6 +171,31 @@ If packages are still pending or the services are unhealthy, rerun:
|
||||
sudo bash scripts/ubuntu-vm-security-update.sh
|
||||
```
|
||||
|
||||
## 10. Operate Hermes Agent Safely
|
||||
|
||||
Use:
|
||||
|
||||
```bash
|
||||
hermes --version
|
||||
hermes status --all
|
||||
hermes cron list
|
||||
systemctl status hermes-gateway --no-pager
|
||||
```
|
||||
|
||||
Read first:
|
||||
|
||||
- `docs/hermes-setup-upgrade-roadmap.md`
|
||||
- `docs/hermes-operations.md`
|
||||
|
||||
Use this when:
|
||||
|
||||
- you are upgrading or troubleshooting Hermes
|
||||
- you are checking Telegram gateway health
|
||||
- you are verifying backup/watchdog cron jobs
|
||||
- you are evaluating any private-only dashboard/API access pattern
|
||||
|
||||
Hard rule: do **not** expose a Hermes dashboard or Hermes API publicly unless S explicitly approves the exact hostname, auth gate, and access path.
|
||||
|
||||
## Team Guidance
|
||||
|
||||
- Prefer the supported entry points in `docs/tooling-status.md`.
|
||||
|
||||
@ -51,6 +51,7 @@ Current key files:
|
||||
- `docs/operations.md`
|
||||
- `docs/remove_user_interactive.md`
|
||||
- `docs/hermes-setup-upgrade-roadmap.md`
|
||||
- `docs/hermes-operations.md`
|
||||
|
||||
### `.github/workflows/`
|
||||
|
||||
|
||||
94
scripts/hermes-health-watchdog.py
Executable file
94
scripts/hermes-health-watchdog.py
Executable file
@ -0,0 +1,94 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Silent-on-success Hermes health watchdog for ByteLyst.
|
||||
|
||||
Designed for a Hermes no-agent cron job. It prints nothing when healthy and
|
||||
prints a concise Telegram-ready alert when an actionable problem is detected.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85"))
|
||||
BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90"))
|
||||
BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub")
|
||||
GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service")
|
||||
HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes")))
|
||||
|
||||
|
||||
def run(cmd: list[str], timeout: int = 20) -> subprocess.CompletedProcess[str]:
|
||||
return subprocess.run(cmd, text=True, capture_output=True, timeout=timeout, check=False)
|
||||
|
||||
|
||||
def check_gateway(alerts: list[str]) -> None:
|
||||
result = run(["systemctl", "is-active", GATEWAY_SERVICE])
|
||||
if result.stdout.strip() != "active":
|
||||
alerts.append(f"gateway service `{GATEWAY_SERVICE}` is not active: `{result.stdout.strip() or result.stderr.strip() or 'unknown'}`")
|
||||
|
||||
|
||||
def check_backup_cron(alerts: list[str]) -> None:
|
||||
result = run(["hermes", "cron", "list"], timeout=30)
|
||||
out = result.stdout + result.stderr
|
||||
if result.returncode != 0:
|
||||
alerts.append(f"`hermes cron list` failed with exit {result.returncode}")
|
||||
return
|
||||
if BACKUP_JOB_NAME not in out:
|
||||
alerts.append(f"backup cron job `{BACKUP_JOB_NAME}` was not found")
|
||||
return
|
||||
if "Last run:" not in out or " ok" not in out:
|
||||
alerts.append("backup cron last-run status is not visibly `ok` in `hermes cron list`")
|
||||
|
||||
script_path = HERMES_HOME / "scripts" / "sync_hermes_persistent_backup.py"
|
||||
if script_path.exists():
|
||||
age_minutes = (datetime.now(timezone.utc).timestamp() - script_path.stat().st_mtime) / 60
|
||||
# Script mtime is not backup freshness; keep this as a weak sanity note only.
|
||||
if age_minutes < 0:
|
||||
alerts.append("backup sync script has a future modification time")
|
||||
|
||||
|
||||
def check_backup_repo_freshness(alerts: list[str]) -> None:
|
||||
repo = Path(os.getenv("HERMES_WATCHDOG_BACKUP_REPO", str(HERMES_HOME / "persistent_backup_repo")))
|
||||
candidates = [repo, Path.home() / "hermes_persistent_backup", Path.home() / "hermes_persistent_backup_repo"]
|
||||
existing = next((p for p in candidates if (p / ".git").exists()), None)
|
||||
if not existing:
|
||||
# The backup cron may use its own path; cron status is the primary check.
|
||||
return
|
||||
result = run(["git", "-C", str(existing), "log", "-1", "--format=%ct"], timeout=20)
|
||||
if result.returncode != 0 or not result.stdout.strip().isdigit():
|
||||
alerts.append(f"could not inspect backup repo freshness at `{existing}`")
|
||||
return
|
||||
age_minutes = (datetime.now(timezone.utc).timestamp() - int(result.stdout.strip())) / 60
|
||||
if age_minutes > BACKUP_STALE_MINUTES:
|
||||
alerts.append(f"backup repo `{existing}` latest commit is stale: {age_minutes:.0f} minutes old")
|
||||
|
||||
|
||||
def check_disk(alerts: list[str]) -> None:
|
||||
usage = shutil.disk_usage("/")
|
||||
pct = int(round((usage.used / usage.total) * 100))
|
||||
if pct >= DISK_WARN_PERCENT:
|
||||
alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
alerts: list[str] = []
|
||||
for check in (check_gateway, check_backup_cron, check_backup_repo_freshness, check_disk):
|
||||
try:
|
||||
check(alerts)
|
||||
except Exception as exc: # noqa: BLE001 - watchdog should alert, not crash silently
|
||||
alerts.append(f"{check.__name__} errored: {exc}")
|
||||
|
||||
if alerts:
|
||||
print("🚨 ByteLyst Hermes watchdog alert")
|
||||
for item in alerts:
|
||||
print(f"- {item}")
|
||||
print("\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, `hermes cron list`, `df -h /`.")
|
||||
return 0
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Loading…
Reference in New Issue
Block a user