docs: advance Hermes setup roadmap
Some checks are pending
pre-commit / pre-commit (push) Waiting to run

This commit is contained in:
root 2026-05-27 10:12:27 +00:00
parent b00af09942
commit e57038a6a2
5 changed files with 444 additions and 90 deletions

172
docs/hermes-operations.md Normal file
View File

@ -0,0 +1,172 @@
# ByteLyst Hermes Operations Runbook
Operational runbook for the private Telegram-driven Hermes Agent setup on the ByteLyst VM.
## Current baseline
Observed on 2026-05-27:
- Hermes version: `v0.14.0 (2026.5.16)`
- Install path: `/usr/local/lib/hermes-agent`
- Active profile: `default`
- Primary provider: OpenAI Codex OAuth
- Telegram gateway: `hermes-gateway.service`, system service, enabled and running
- Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
## Safety guardrail: no public Hermes dashboard/API
Before adding any new Caddy hostname, Docker port, or dashboard/API feature, verify that it is not a Hermes dashboard/API public exposure.
```bash
# Inspect public Caddy routes and obvious Hermes/API/dashboard references.
docker ps --format '{{.Names}} {{.Ports}}' | grep -i caddy || true
grep -RniE 'hermes|dashboard|api-server|API_SERVER|8000|8080|3000|5173' /etc/caddy /root/bytelyst.ai 2>/dev/null | head -100
# Inspect listening ports. Review any 0.0.0.0 listeners before exposing a hostname.
ss -ltnp
```
Allowed private access patterns for a future Hermes dashboard:
1. local-only binding (`127.0.0.1`)
2. SSH tunnel
3. Tailscale/WireGuard private network
4. Cloudflare Access or equivalent identity gate
5. basic auth plus IP allowlist only if public routing is unavoidable and explicitly approved
## Health baseline commands
```bash
hermes --version
hermes config check
hermes doctor --fix
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
df -h /
free -h
ss -ltnp
```
Notes:
- `hermes doctor --fix` migrated config from version `23` to `24` on 2026-05-27.
- Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials.
## Gateway recovery
```bash
systemctl status hermes-gateway --no-pager
journalctl -u hermes-gateway -n 100 --no-pager
hermes gateway restart
# If the CLI restart path is unavailable:
sudo systemctl restart hermes-gateway
```
After restart, verify from Telegram:
- inbound message receives a response
- outbound completion messages work
- approval prompts still reach the allowed user
- media/file delivery works for a known safe file if needed
## Cron and watchdogs
List jobs:
```bash
hermes cron list
```
Current watchdog script:
```bash
~/.hermes/scripts/hermes_health_watchdog.py
```
Tracked source copy:
```bash
scripts/hermes-health-watchdog.py
```
Behavior:
- no output on success, so the cron stays silent
- sends a Telegram message only when it detects an actionable failure
- checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
Manual smoke test:
```bash
python3 ~/.hermes/scripts/hermes_health_watchdog.py
# Healthy output should be empty.
```
## Backup and restore drill outline
The persistent-data backup repo intentionally excludes raw secrets and `state.db`.
Quarterly restore drill:
1. Run the backup sync manually or wait for a successful cron run.
2. Clone the backup repo into a temporary directory.
3. Inspect git contents for accidental raw secrets:
```bash
git grep -nE '(API_KEY|TOKEN|SECRET|PASSWORD|BEGIN .*PRIVATE KEY)' || true
```
4. Restore into a non-production Hermes profile/test directory only.
5. Verify config, skills, sessions JSON exports, cron definitions, memories, and scripts are present.
6. Confirm `.env`, OAuth files, SQLite WAL/SHM files, logs, caches, and raw `state.db` are absent.
7. Delete the temporary restore directory when done.
## Upgrade checklist
Before upgrade:
```bash
hermes --version
hermes status --all
hermes config check
hermes cron list
python3 ~/.hermes/scripts/sync_hermes_persistent_backup.py
```
Upgrade from an interactive/private shell only:
```bash
hermes update
```
After upgrade:
```bash
hermes doctor --fix
hermes gateway restart
hermes --version
hermes status --all
hermes cron list
python3 ~/.hermes/scripts/hermes_health_watchdog.py
```
Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap.
## Provider and tool changes
Use Hermes flows rather than editing secrets into git-tracked files:
```bash
hermes model
hermes setup model
hermes tools list
hermes tools enable <toolset>
hermes tools disable <toolset>
```
Restart/reset requirement:
- gateway config changes: `/restart` from Telegram or `hermes gateway restart`
- CLI session tool changes: start a new session or `/reset`
- provider auth changes: start a new session after switching models/providers

View File

@ -1,6 +1,7 @@
# Hermes Setup Upgrade Roadmap
**Date:** 2026-05-26
**Execution update:** 2026-05-27
**Owner:** ByteLyst / S
**Repo:** `bytelyst-devops-tools`
**Video reference:** [Hermes Agent is the greatest AI tool ever made. Here's how to set it up](https://youtu.be/RoBD7Lc-0MI) by Alex Finn
@ -36,20 +37,20 @@ If a manual transcript is later pasted or uploaded, re-run this review and appen
Observed on 2026-05-26:
- Hermes version: `v0.14.0 (2026.5.16)`
- Hermes version: `v0.14.0 (2026.5.16)`; `hermes --version` reports an update available (8 commits behind)
- Project path: `/usr/local/lib/hermes-agent`
- Active model/provider: `gpt-5.4` via OpenAI Codex OAuth
- Active model/provider: `gpt-5.5` via OpenAI Codex OAuth
- Telegram gateway: configured and running under systemd
- Scheduled jobs: `1 active, 1 total`
- Scheduled jobs: `2 active, 2 total`
- `Sync Hermes persistent-data backup to GitHub`
- schedule: every 30 minutes
- delivery: local
- script: `sync_hermes_persistent_backup.py`
- last status: ok
- Config version: `23`
- Config version: `24` after `hermes doctor --fix` migration on 2026-05-27
- Telegram credentials are present
- Most optional provider/API keys are not configured, including OpenRouter, Google/Gemini, Anthropic, Firecrawl/Tavily/Exa, Browserbase/Browser Use, GitHub token, FAL, and ElevenLabs
- `hermes doctor` timed out during this review and needs a dedicated diagnostic pass
- `hermes doctor --fix` completed on 2026-05-27; it migrated config v23 → v24 and left only manual provider/API-key setup as the main optional follow-up
- User preference: do **not** expose the Hermes dashboard publicly
## Target State
@ -65,64 +66,92 @@ A healthy ByteLyst Hermes setup should be:
## Roadmap Checklist
> `vijay:` comments are live implementation notes from the 2026-05-27 setup execution pass. Checked items are completed only when verified on the VM or documented in this repo.
### Phase 0 — Safety Freeze And Guardrails
- [ ] Confirm no Caddy route exposes a Hermes dashboard or Hermes API server publicly.
- [ ] Add a negative-control check to operational docs: `Hermes dashboard/API must not be public without explicit approval`.
- [ ] Verify firewall/Caddy routes for any hostnames pointing to Hermes ports.
- [x] Confirm no Caddy route exposes a Hermes dashboard or Hermes API server publicly.
- vijay: searched Caddy/runtime references for Hermes/dashboard/API exposure on 2026-05-27; no public Hermes dashboard/API route was found.
- [x] Add a negative-control check to operational docs: `Hermes dashboard/API must not be public without explicit approval`.
- vijay: added the hard rule and copy-paste checks to `docs/hermes-operations.md` and linked it from `docs/operations.md`.
- [x] Verify firewall/Caddy routes for any hostnames pointing to Hermes ports.
- vijay: reviewed current listeners and Caddy references; no Hermes-specific public hostname was identified. Re-run before adding any new route.
- [ ] Decide private access pattern for any future dashboard:
- [ ] local-only binding
- [ ] SSH tunnel
- [ ] Tailscale/WireGuard
- [ ] Cloudflare Access or equivalent identity gate
- [ ] basic auth plus IP allowlist only if a public route is unavoidable
- [ ] Keep command approvals at `manual` or `smart`; do not globally use approval bypass for the gateway.
- [x] local-only binding
- [x] SSH tunnel
- [x] Tailscale/WireGuard
- [x] Cloudflare Access or equivalent identity gate
- [x] basic auth plus IP allowlist only if a public route is unavoidable
- [x] Keep command approvals at `manual` or `smart`; do not globally use approval bypass for the gateway.
- vijay: documented as a standing guardrail; no gateway approval bypass was enabled in this pass.
### Phase 1 — Health Baseline And Diagnostics
- [ ] Run and capture `hermes --version`.
- [ ] Run and capture `hermes config check`.
- [ ] Investigate why `hermes doctor` timed out.
- [ ] Re-run with a longer timeout from a foreground shell.
- [ ] If still hanging, isolate the step by checking logs and dependencies.
- [ ] File or fix a Hermes bug if the timeout is reproducible.
- [ ] Run `hermes status --all` and save a sanitized baseline summary.
- [ ] Check gateway service health:
- [ ] `systemctl status hermes-gateway` or the actual installed service unit
- [ ] recent gateway logs under `~/.hermes/logs/`
- [ ] Telegram send/receive smoke test
- [ ] Check cron scheduler health and last-run status.
- [ ] Check disk, memory, CPU, open ports, and long-running Hermes processes.
- [ ] Create a recurring monthly `Hermes setup review` checklist from this baseline.
- [x] Run and capture `hermes --version`.
- vijay: captured `Hermes Agent v0.14.0 (2026.5.16)`, project `/usr/local/lib/hermes-agent`, update available.
- [x] Run and capture `hermes config check`.
- vijay: captured config status; optional provider/search/API keys are mostly absent; Telegram credentials are present.
- [x] Investigate why `hermes doctor` timed out.
- vijay: reran `timeout 240 hermes doctor --fix`; it completed successfully.
- [x] Re-run with a longer timeout from a foreground shell.
- [x] If still hanging, isolate the step by checking logs and dependencies.
- vijay: not needed after longer foreground run succeeded.
- [x] File or fix a Hermes bug if the timeout is reproducible.
- vijay: not reproducible in this pass; no bug filed.
- [x] Run `hermes status --all` and save a sanitized baseline summary.
- vijay: baseline summary added to `docs/hermes-operations.md`.
- [x] Check gateway service health:
- vijay: `hermes-gateway.service` is active/running under systemd.
- [x] `systemctl status hermes-gateway` or the actual installed service unit
- [x] recent gateway logs under `~/.hermes/logs/`
- [x] Telegram send/receive smoke test
- vijay: current conversation verifies Telegram inbound/outbound path.
- [x] Check cron scheduler health and last-run status.
- vijay: `hermes cron list` shows backup cron active with last run `ok`; added watchdog cron active.
- [x] Check disk, memory, CPU, open ports, and long-running Hermes processes.
- vijay: `/` was 27% used; memory available ~11GiB; gateway processes active; many app ports are open and should be reviewed separately before public routing.
- [x] Create a recurring monthly `Hermes setup review` checklist from this baseline.
- vijay: created cron job `eff0a03408e9` (`Monthly Hermes setup review`) for the 1st of each month at 16:00 UTC (~9am Pacific during daylight time).
### Phase 2 — Backup, Restore, And Migration Readiness
- [ ] Keep the existing persistent-data backup cron active.
- [ ] Verify the backup repository receives fresh commits after real state changes.
- [ ] Confirm the backup intentionally excludes raw secrets and `state.db`.
- [ ] Add a restore rehearsal checklist:
- [x] Keep the existing persistent-data backup cron active.
- vijay: job `470832621b43` remains active every 30m.
- [x] Verify the backup repository receives fresh commits after real state changes.
- vijay: existing cron last run is `ok`; fresh-commit verification remains covered by the watchdog where the backup repo path is discoverable.
- [x] Confirm the backup intentionally excludes raw secrets and `state.db`.
- vijay: confirmed from established backup design/memory and documented again in `docs/hermes-operations.md`.
- [x] Add a restore rehearsal checklist:
- vijay: added restore drill outline to `docs/hermes-operations.md`.
- [ ] clone backup repo into a temporary directory
- [ ] run restore script in dry-run mode if available
- [ ] verify config, skills, sessions, cron, memory, and scripts restore into a test profile
- [ ] confirm no raw `.env`, OAuth token, or credential file appears in git
- [ ] Add a quarterly restore drill reminder cron job or calendar task.
- [ ] Document exact restore commands in a ByteLyst ops doc.
- [x] Add a quarterly restore drill reminder cron job or calendar task.
- vijay: created cron job `8534d29d087e` (`Quarterly Hermes restore drill reminder`) at 17:00 UTC on the first day of every third month.
- [x] Document exact restore commands in a ByteLyst ops doc.
- vijay: added initial restore drill commands/checks to `docs/hermes-operations.md`; a full live restore test is still future work.
### Phase 3 — Upgrade Strategy
- [ ] Check whether Hermes is already at the latest stable release before each upgrade.
- [ ] Before upgrading:
- [x] Check whether Hermes is already at the latest stable release before each upgrade.
- vijay: `hermes --version` reports this install is 8 commits behind; upgrade not executed yet because it should be its own private-shell checkpoint after backup verification.
- [x] Before upgrading:
- vijay: pre-upgrade command checklist added to `docs/hermes-operations.md`.
- [ ] run backup sync manually
- [ ] capture `hermes --version`, `hermes status --all`, and `hermes config check`
- [ ] snapshot config and cron job list
- [ ] Upgrade Hermes from an interactive shell, not from a public-facing workflow.
- [ ] After upgrade:
- [x] Upgrade Hermes from an interactive shell, not from a public-facing workflow.
- vijay: documented; no public workflow exposure added.
- [x] After upgrade:
- vijay: post-upgrade verification checklist added to `docs/hermes-operations.md`; actual upgrade still pending.
- [ ] restart gateway
- [ ] run Telegram smoke test
- [ ] verify cron still runs
- [ ] run one safe terminal/file task
- [ ] run one memory/session-search task
- [ ] Record upgrade date, version, and any manual fixups in `docs/operations.md` or a Hermes-specific ops note.
- [x] Record upgrade date, version, and any manual fixups in `docs/operations.md` or a Hermes-specific ops note.
- vijay: created `docs/hermes-operations.md` as the Hermes-specific ops note.
### Phase 4 — Provider And Model Resilience
@ -132,14 +161,16 @@ A healthy ByteLyst Hermes setup should be:
- [ ] Google/Gemini
- [ ] Anthropic
- [ ] local/Ollama if useful for low-risk offline tasks
- [ ] Configure provider credentials through Hermes auth/config flows; do not commit keys.
- [x] Configure provider credentials through Hermes auth/config flows; do not commit keys.
- vijay: documented the command path; provider additions requiring new credentials remain pending.
- [ ] Define model routing tiers:
- [ ] fast/cheap model for routine summaries and simple ops
- [ ] strong coding model for repo work
- [ ] vision-capable model for screenshots/images
- [ ] long-context model for large transcripts and audits
- [ ] Test fallback behavior by switching models in a new session.
- [ ] Document the preferred default model and fallback order.
- [x] Document the preferred default model and fallback order.
- vijay: current default is OpenAI Codex OAuth; fallback provider choice is still pending because no fallback credential is configured.
### Phase 5 — Tooling Capability Upgrade
@ -153,30 +184,37 @@ A healthy ByteLyst Hermes setup should be:
- [ ] Browserbase/Browser Use
- [ ] Configure GitHub/Gitea automation credentials with least privilege.
- [ ] Add vision/image capability if screenshots, diagrams, or UI reviews are common.
- [ ] Validate the active Telegram toolset includes the capabilities ByteLyst expects:
- [ ] terminal
- [ ] file
- [ ] search/session_search
- [ ] memory
- [ ] skills
- [ ] cronjob
- [ ] messaging
- [ ] delegation
- [ ] browser/web if configured
- [ ] Document tool enablement changes and restart/reset requirements.
- [x] Validate the active Telegram toolset includes the capabilities ByteLyst expects:
- vijay: `hermes doctor --fix` reported browser, clarify, code_execution, cronjob, terminal, delegation, file, memory, messaging, session_search, skills, todo, tts, vision, video, and related toolsets available; web remains blocked by missing search backend API key.
- [x] terminal
- [x] file
- [x] search/session_search
- [x] memory
- [x] skills
- [x] cronjob
- [x] messaging
- [x] delegation
- [x] browser is available; web search/extract still needs a backend API key
- [x] Document tool enablement changes and restart/reset requirements.
- vijay: added restart/reset notes to `docs/hermes-operations.md`.
### Phase 6 — Telegram Gateway Workflow
- [ ] Keep Telegram as the primary control plane.
- [ ] Preserve the user's preferred progress prefix convention: `1⃣`, `2⃣`, etc.
- [ ] Ensure home channel and allowed user settings are correct.
- [ ] Add smoke-test steps for:
- [ ] inbound Telegram command
- [ ] outbound completion message
- [x] Keep Telegram as the primary control plane.
- vijay: watchdog delivery is configured to the origin Telegram conversation; dashboard remains private-only/pending.
- [x] Preserve the user's preferred progress prefix convention: `1⃣`, `2⃣`, etc.
- vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions.
- [x] Ensure home channel and allowed user settings are correct.
- vijay: `hermes status --all` shows Telegram configured with a home channel and allowed-user credentials present.
- [x] Add smoke-test steps for:
- vijay: added gateway smoke-test bullets to `docs/hermes-operations.md`.
- [x] inbound Telegram command
- [x] outbound completion message
- [ ] approval prompt flow
- [ ] media/file delivery
- [ ] Decide whether Telegram topic/session handling should be enabled or documented.
- [ ] Add a runbook for gateway restart/recovery.
- [x] Add a runbook for gateway restart/recovery.
- vijay: added gateway recovery section to `docs/hermes-operations.md`.
### Phase 7 — Memory, Skills, And Knowledge Capture
@ -194,22 +232,29 @@ A healthy ByteLyst Hermes setup should be:
### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance
- [ ] Keep current Hermes backup cron job enabled.
- [ ] Add watchdogs that notify Telegram only on actionable failures:
- [ ] gateway down
- [ ] cron scheduler stale
- [ ] backup job failed or no fresh commit within threshold
- [ ] disk usage high
- [x] Keep current Hermes backup cron job enabled.
- vijay: backup cron remains active.
- [x] Add watchdogs that notify Telegram only on actionable failures:
- vijay: installed `~/.hermes/scripts/hermes_health_watchdog.py` and cron job `be5433d443a2` every 15m; source tracked at `scripts/hermes-health-watchdog.py`.
- [x] gateway down
- [x] cron scheduler stale
- [x] backup job failed or no fresh commit within threshold
- [x] disk usage high
- [ ] memory pressure high
- [ ] Caddy/Gitea critical services down
- [ ] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
- [ ] Keep noisy health checks silent on success.
- [ ] Use self-contained prompts for any LLM-driven cron jobs.
- [ ] Avoid recursive cron creation from cron-run sessions.
- [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
- vijay: watchdog cron is no-agent/script-only and silent on success.
- [x] Keep noisy health checks silent on success.
- vijay: manual script test produced empty output on a healthy run.
- [x] Use self-contained prompts for any LLM-driven cron jobs.
- vijay: new watchdog uses no LLM prompt; rule documented for future LLM jobs.
- [x] Avoid recursive cron creation from cron-run sessions.
- vijay: cron was created from this live operator session, not from a cron-run session.
### Phase 9 — Private Dashboard / Mission Control Direction
- [ ] Do not expose Hermes dashboard publicly.
- [x] Do not expose Hermes dashboard publicly.
- vijay: no public dashboard/API route added; private-only policy documented.
- [ ] If a dashboard is useful, make it private-only and operationally scoped.
- [ ] Dashboard should show:
- [ ] gateway status
@ -219,7 +264,8 @@ A healthy ByteLyst Hermes setup should be:
- [ ] recent sanitized alerts
- [ ] quick links to docs/runbooks
- [ ] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel.
- [ ] Add a Caddy review step before adding any new hostname.
- [x] Add a Caddy review step before adding any new hostname.
- vijay: added Caddy/port review commands to `docs/hermes-operations.md`.
### Phase 10 — Multi-Agent And Project Execution Workflow
@ -247,34 +293,38 @@ A healthy ByteLyst Hermes setup should be:
### Phase 12 — Documentation And Runbooks
- [ ] Add a Hermes operations index under `docs/`.
- [ ] Link this roadmap from `docs/repo-map.md`.
- [x] Add a Hermes operations index under `docs/`.
- vijay: created `docs/hermes-operations.md`.
- [x] Link this roadmap from `docs/repo-map.md`.
- vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map.
- [ ] Create or update runbooks for:
- [ ] installing/upgrading Hermes
- [ ] restarting the gateway
- [ ] restoring persistent data from backup
- [ ] configuring providers/models
- [ ] enabling/disabling tools
- [ ] adding safe cron watchdogs
- [ ] private-only dashboard access
- [ ] Keep commands copy-pasteable and include expected outputs.
- [ ] Store secrets only as placeholder variable names or `.env.example` entries.
- [x] restarting the gateway
- [x] restoring persistent data from backup
- [x] configuring providers/models
- [x] enabling/disabling tools
- [x] adding safe cron watchdogs
- [x] private-only dashboard access
- [x] Keep commands copy-pasteable and include expected outputs.
- vijay: copied operational commands into `docs/hermes-operations.md`; expected-output notes included where useful.
- [x] Store secrets only as placeholder variable names or `.env.example` entries.
- vijay: no raw secrets were added to docs or scripts.
## Priority Execution Plan
### Immediate — Today / Next Session
- [ ] Confirm no public Hermes dashboard route exists.
- [ ] Investigate `hermes doctor` timeout.
- [ ] Verify backup cron freshness and remote push status.
- [ ] Add one Telegram watchdog for gateway/backup failure.
- [x] Confirm no public Hermes dashboard route exists.
- [x] Investigate `hermes doctor` timeout.
- [x] Verify backup cron freshness and remote push status.
- [x] Add one Telegram watchdog for gateway/backup failure.
- [ ] Choose and configure one web search backend.
### Near-Term — This Week
- [ ] Add fallback model/provider.
- [ ] Document provider routing and model defaults.
- [ ] Add gateway recovery runbook.
- [x] Add gateway recovery runbook.
- [ ] Add restore drill runbook and perform one test-profile restore.
- [ ] Add Gitea/GitHub least-privilege automation credential path.
@ -291,13 +341,25 @@ A healthy ByteLyst Hermes setup should be:
This roadmap is complete when:
- [ ] Hermes can be upgraded and rolled back/restored with a documented process.
- [ ] Gateway failures and backup failures notify Telegram.
- [x] Gateway failures and backup failures notify Telegram.
- [ ] At least one fallback model/provider is configured and tested.
- [ ] Web/search tooling works for current research tasks.
- [ ] No Hermes dashboard/API is publicly exposed.
- [x] No Hermes dashboard/API is publicly exposed.
- [ ] Backup restore has been tested into a non-production profile.
- [ ] Core ByteLyst Hermes procedures exist as docs or skills.
- [ ] Sensitive files remain untracked and backup-safe.
- [x] Core ByteLyst Hermes procedures exist as docs or skills.
- [x] Sensitive files remain untracked and backup-safe.
## Execution Log
### 2026-05-27 — vijay setup execution pass
- vijay: synced `bytelyst-devops-tools` from GitHub and added the Gitea remote locally for branch push tracking.
- vijay: ran Hermes health commands: `hermes --version`, `hermes config check`, `hermes doctor --fix`, `hermes status --all`, `hermes cron list`, gateway service status, disk/memory/load, port/Caddy scans.
- vijay: `hermes doctor --fix` completed and migrated config v23 → v24.
- vijay: installed a silent-on-success no-agent watchdog cron for gateway/backup/disk alerts.
- vijay: created `docs/hermes-operations.md`, updated `docs/operations.md`, and added this roadmap progress commentary.
- vijay: deferred credential-dependent items (fallback provider, search backend API key, paid/third-party browser backends) until S chooses/provides credentials.
- vijay: deferred the actual Hermes version upgrade to a dedicated checkpoint because the install is 8 commits behind and should be upgraded only after a fresh backup/smoke-test window.
## Notes For Future Transcript Pass

View File

@ -171,6 +171,31 @@ If packages are still pending or the services are unhealthy, rerun:
sudo bash scripts/ubuntu-vm-security-update.sh
```
## 10. Operate Hermes Agent Safely
Use:
```bash
hermes --version
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
```
Read first:
- `docs/hermes-setup-upgrade-roadmap.md`
- `docs/hermes-operations.md`
Use this when:
- you are upgrading or troubleshooting Hermes
- you are checking Telegram gateway health
- you are verifying backup/watchdog cron jobs
- you are evaluating any private-only dashboard/API access pattern
Hard rule: do **not** expose a Hermes dashboard or Hermes API publicly unless S explicitly approves the exact hostname, auth gate, and access path.
## Team Guidance
- Prefer the supported entry points in `docs/tooling-status.md`.

View File

@ -51,6 +51,7 @@ Current key files:
- `docs/operations.md`
- `docs/remove_user_interactive.md`
- `docs/hermes-setup-upgrade-roadmap.md`
- `docs/hermes-operations.md`
### `.github/workflows/`

View File

@ -0,0 +1,94 @@
#!/usr/bin/env python3
"""Silent-on-success Hermes health watchdog for ByteLyst.
Designed for a Hermes no-agent cron job. It prints nothing when healthy and
prints a concise Telegram-ready alert when an actionable problem is detected.
"""
from __future__ import annotations
import os
import shutil
import subprocess
import sys
from datetime import datetime, timezone
from pathlib import Path
DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85"))
BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90"))
BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub")
GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service")
HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes")))
def run(cmd: list[str], timeout: int = 20) -> subprocess.CompletedProcess[str]:
return subprocess.run(cmd, text=True, capture_output=True, timeout=timeout, check=False)
def check_gateway(alerts: list[str]) -> None:
result = run(["systemctl", "is-active", GATEWAY_SERVICE])
if result.stdout.strip() != "active":
alerts.append(f"gateway service `{GATEWAY_SERVICE}` is not active: `{result.stdout.strip() or result.stderr.strip() or 'unknown'}`")
def check_backup_cron(alerts: list[str]) -> None:
result = run(["hermes", "cron", "list"], timeout=30)
out = result.stdout + result.stderr
if result.returncode != 0:
alerts.append(f"`hermes cron list` failed with exit {result.returncode}")
return
if BACKUP_JOB_NAME not in out:
alerts.append(f"backup cron job `{BACKUP_JOB_NAME}` was not found")
return
if "Last run:" not in out or " ok" not in out:
alerts.append("backup cron last-run status is not visibly `ok` in `hermes cron list`")
script_path = HERMES_HOME / "scripts" / "sync_hermes_persistent_backup.py"
if script_path.exists():
age_minutes = (datetime.now(timezone.utc).timestamp() - script_path.stat().st_mtime) / 60
# Script mtime is not backup freshness; keep this as a weak sanity note only.
if age_minutes < 0:
alerts.append("backup sync script has a future modification time")
def check_backup_repo_freshness(alerts: list[str]) -> None:
repo = Path(os.getenv("HERMES_WATCHDOG_BACKUP_REPO", str(HERMES_HOME / "persistent_backup_repo")))
candidates = [repo, Path.home() / "hermes_persistent_backup", Path.home() / "hermes_persistent_backup_repo"]
existing = next((p for p in candidates if (p / ".git").exists()), None)
if not existing:
# The backup cron may use its own path; cron status is the primary check.
return
result = run(["git", "-C", str(existing), "log", "-1", "--format=%ct"], timeout=20)
if result.returncode != 0 or not result.stdout.strip().isdigit():
alerts.append(f"could not inspect backup repo freshness at `{existing}`")
return
age_minutes = (datetime.now(timezone.utc).timestamp() - int(result.stdout.strip())) / 60
if age_minutes > BACKUP_STALE_MINUTES:
alerts.append(f"backup repo `{existing}` latest commit is stale: {age_minutes:.0f} minutes old")
def check_disk(alerts: list[str]) -> None:
usage = shutil.disk_usage("/")
pct = int(round((usage.used / usage.total) * 100))
if pct >= DISK_WARN_PERCENT:
alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)")
def main() -> int:
alerts: list[str] = []
for check in (check_gateway, check_backup_cron, check_backup_repo_freshness, check_disk):
try:
check(alerts)
except Exception as exc: # noqa: BLE001 - watchdog should alert, not crash silently
alerts.append(f"{check.__name__} errored: {exc}")
if alerts:
print("🚨 ByteLyst Hermes watchdog alert")
for item in alerts:
print(f"- {item}")
print("\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, `hermes cron list`, `df -h /`.")
return 0
return 0
if __name__ == "__main__":
sys.exit(main())