Complete Hermes dashboard and watchdog roadmap audit
This commit is contained in:
parent
15ac960faf
commit
8de72351de
@ -18,6 +18,9 @@ Observed on 2026-05-27:
|
|||||||
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
|
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
|
||||||
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
|
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
|
||||||
- Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10`
|
- Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10`
|
||||||
|
- Private dashboards:
|
||||||
|
- Root: `http://100.87.53.10:9119/`, `hermes-root-dashboard.service`
|
||||||
|
- Uma: `http://100.87.53.10:9120/`, `uma-hermes-dashboard.service`
|
||||||
|
|
||||||
## Safety guardrail: no public Hermes dashboard/API
|
## Safety guardrail: no public Hermes dashboard/API
|
||||||
|
|
||||||
@ -48,6 +51,25 @@ tailscale ip -4
|
|||||||
# Expected server IPv4: 100.87.53.10
|
# Expected server IPv4: 100.87.53.10
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Private dashboard services:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status hermes-root-dashboard --no-pager
|
||||||
|
systemctl status uma-hermes-dashboard --no-pager
|
||||||
|
ss -ltnp | grep -E ':(9119|9120)'
|
||||||
|
|
||||||
|
# Expected listeners are Tailscale-only:
|
||||||
|
# 100.87.53.10:9119
|
||||||
|
# 100.87.53.10:9120
|
||||||
|
```
|
||||||
|
|
||||||
|
Tracked service unit templates:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemd/hermes-root-dashboard.service
|
||||||
|
systemd/uma-hermes-dashboard.service
|
||||||
|
```
|
||||||
|
|
||||||
## Health baseline commands
|
## Health baseline commands
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -115,6 +137,7 @@ Behavior:
|
|||||||
- no output on success, so the cron stays silent
|
- no output on success, so the cron stays silent
|
||||||
- sends a Telegram message only when it detects an actionable failure
|
- sends a Telegram message only when it detects an actionable failure
|
||||||
- checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
|
- checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
|
||||||
|
- also checks memory pressure plus critical Caddy/Gitea Docker containers (`caddy`, `gitea-npm-registry`)
|
||||||
|
|
||||||
Manual smoke test:
|
Manual smoke test:
|
||||||
|
|
||||||
@ -202,3 +225,27 @@ Restart/reset requirement:
|
|||||||
- gateway config changes: `/restart` from Telegram or `hermes gateway restart`
|
- gateway config changes: `/restart` from Telegram or `hermes gateway restart`
|
||||||
- CLI session tool changes: start a new session or `/reset`
|
- CLI session tool changes: start a new session or `/reset`
|
||||||
- provider auth changes: start a new session after switching models/providers
|
- provider auth changes: start a new session after switching models/providers
|
||||||
|
|
||||||
|
## Telegram topics and session handling
|
||||||
|
|
||||||
|
Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need.
|
||||||
|
|
||||||
|
Review these before changing Telegram routing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status hermes-gateway --no-pager
|
||||||
|
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
|
||||||
|
grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100
|
||||||
|
```
|
||||||
|
|
||||||
|
## Multi-agent execution conventions
|
||||||
|
|
||||||
|
Use the smallest execution surface that fits the task:
|
||||||
|
|
||||||
|
- direct tool call: one-shot local checks, edits, commits, pushes, status reads
|
||||||
|
- `delegate_task`: bounded research or code inspection that can return inside the parent session
|
||||||
|
- background terminal process: long-running local commands that need monitoring
|
||||||
|
- cron job: recurring, deterministic, silent-on-success maintenance
|
||||||
|
- Kanban worker: durable multi-agent project coordination after the board is intentionally configured
|
||||||
|
|
||||||
|
Telegram progress/completion updates should keep the user's numbered-prefix convention (`1`, `2`, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable.
|
||||||
|
|||||||
@ -8,10 +8,18 @@
|
|||||||
|
|
||||||
## Completion Status
|
## Completion Status
|
||||||
|
|
||||||
- **Overall checklist completion:** ~57% (`102/179` checked after auditing root-vs-Uma evidence on 2026-05-27).
|
- **Overall checklist completion:** ~68% (`122/179` checked after the 2026-05-27 dashboard/watchdog/runbook audit).
|
||||||
- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, tailnet login, GitHub/Gitea tokens, and policy decisions.
|
- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, GitHub/Gitea tokens, Uma backup design, and policy decisions.
|
||||||
- vijay: percentage is based on literal Markdown checklist boxes, including nested sub-items. It intentionally counts credential-dependent future work as incomplete.
|
- vijay: percentage is based on literal Markdown checklist boxes, including nested sub-items. It intentionally counts credential-dependent future work as incomplete.
|
||||||
|
|
||||||
|
## Remaining Unchecked Item Classification
|
||||||
|
|
||||||
|
- **Needs credentials/API keys:** fallback provider setup, web search/extract backend, GitHub/Gitea automation token, Browserbase/Browser Use, and provider fallback tests.
|
||||||
|
- **Needs explicit policy decision:** Cloudflare Access/basic-auth public fallback, model-routing tiers, local browser automation, vision/image provider choice, `security.redact_secrets`, `privacy.redact_pii`, and credential rotation.
|
||||||
|
- **Needs Uma backup design:** Uma/Bheem currently has a clean VM wrapper repo, but not a root-style sanitized Hermes persistent backup/restore workflow.
|
||||||
|
- **Needs manual UX validation:** dashboard feature-by-feature checks, Telegram approval prompt flow, and Telegram media/file delivery.
|
||||||
|
- **Needs future workflow adoption:** practicing `delegate_task`, spawned/tmux sessions, worktrees, and Kanban on real tasks before checking them as completed.
|
||||||
|
|
||||||
## Purpose
|
## Purpose
|
||||||
|
|
||||||
Turn the Hermes setup ideas from the referenced video into a practical ByteLyst upgrade checklist for this VM-backed, Telegram-driven Hermes installation.
|
Turn the Hermes setup ideas from the referenced video into a practical ByteLyst upgrade checklist for this VM-backed, Telegram-driven Hermes installation.
|
||||||
@ -243,7 +251,8 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
### Phase 6 — Telegram Gateway Workflow
|
### Phase 6 — Telegram Gateway Workflow
|
||||||
|
|
||||||
- [x] Keep Telegram as the primary control plane.
|
- [x] Keep Telegram as the primary control plane.
|
||||||
- vijay: watchdog delivery is configured to the origin Telegram conversation; dashboard remains private-only/pending.
|
- vijay: watchdog delivery is configured to the origin Telegram conversation; root dashboard is private-only over Tailscale.
|
||||||
|
- bheem: Uma gateway remains Telegram-driven; Uma dashboard is private-only over Tailscale.
|
||||||
- [x] Preserve the user's preferred progress prefix convention: `1️⃣`, `2️⃣`, etc.
|
- [x] Preserve the user's preferred progress prefix convention: `1️⃣`, `2️⃣`, etc.
|
||||||
- vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions.
|
- vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions.
|
||||||
- [x] Ensure home channel and allowed user settings are correct.
|
- [x] Ensure home channel and allowed user settings are correct.
|
||||||
@ -254,22 +263,32 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
- [x] outbound completion message
|
- [x] outbound completion message
|
||||||
- [ ] approval prompt flow
|
- [ ] approval prompt flow
|
||||||
- [ ] media/file delivery
|
- [ ] media/file delivery
|
||||||
- [ ] Decide whether Telegram topic/session handling should be enabled or documented.
|
- [x] Decide whether Telegram topic/session handling should be enabled or documented.
|
||||||
|
- vijay: documented current stance in `docs/hermes-operations.md`: keep default Telegram session handling unless a concrete topic-routing need appears.
|
||||||
|
- bheem: same default-session stance applies to Uma/Bheem.
|
||||||
- [x] Add a runbook for gateway restart/recovery.
|
- [x] Add a runbook for gateway restart/recovery.
|
||||||
- vijay: added gateway recovery section to `docs/hermes-operations.md`.
|
- vijay: added gateway recovery section to `docs/hermes-operations.md`.
|
||||||
|
|
||||||
### Phase 7 — Memory, Skills, And Knowledge Capture
|
### Phase 7 — Memory, Skills, And Knowledge Capture
|
||||||
|
|
||||||
- [ ] Review persistent memory for stale entries and trim anything no longer useful.
|
- [x] Review persistent memory for stale entries and trim anything no longer useful.
|
||||||
- [ ] Keep memories declarative and durable; avoid storing task-completion artifacts.
|
- vijay: reviewed root `MEMORY.md` and `USER.md`; entries are operationally relevant, no safe deletion needed.
|
||||||
|
- bheem: reviewed Uma `MEMORY.md` and `USER.md`; entries are current Bheem context, no safe deletion needed.
|
||||||
|
- [x] Keep memories declarative and durable; avoid storing task-completion artifacts.
|
||||||
|
- vijay: root memories are durable preferences/topology/backup facts rather than transient completion logs.
|
||||||
|
- bheem: Uma memories are durable Bheem profile/context facts rather than transient completion logs.
|
||||||
- [ ] Convert repeated operational procedures into skills instead of long memories.
|
- [ ] Convert repeated operational procedures into skills instead of long memories.
|
||||||
- [ ] Pin critical ByteLyst/Hermes skills that should not be archived.
|
- [ ] Pin critical ByteLyst/Hermes skills that should not be archived.
|
||||||
- [ ] Schedule or manually run curator reviews if enabled.
|
- [ ] Schedule or manually run curator reviews if enabled.
|
||||||
- [ ] Add skills for recurring ByteLyst workflows:
|
- [ ] Add skills for recurring ByteLyst workflows:
|
||||||
- [ ] Gitea Actions troubleshooting
|
- [x] Gitea Actions troubleshooting
|
||||||
- [ ] Caddy + Docker routing changes
|
- vijay: root has `devops/self-hosted-gitea-ci`.
|
||||||
- [ ] Hermes backup/restore drill
|
- [x] Caddy + Docker routing changes
|
||||||
- [ ] Telegram gateway recovery
|
- vijay: root has `devops/caddy-subdomain-routing`.
|
||||||
|
- [x] Hermes backup/restore drill
|
||||||
|
- vijay: root has `devops/hermes-persistent-backup-ops`; Uma backup workflow remains separate and not equivalent.
|
||||||
|
- [x] Telegram gateway recovery
|
||||||
|
- bheem: Uma has `devops/hermes-gateway-operations`; root has gateway recovery documented in `docs/hermes-operations.md`.
|
||||||
- [ ] safe multi-repo commit/push workflow
|
- [ ] safe multi-repo commit/push workflow
|
||||||
|
|
||||||
### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance
|
### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance
|
||||||
@ -282,8 +301,10 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
- [x] cron scheduler stale
|
- [x] cron scheduler stale
|
||||||
- [x] backup job failed or no fresh commit within threshold
|
- [x] backup job failed or no fresh commit within threshold
|
||||||
- [x] disk usage high
|
- [x] disk usage high
|
||||||
- [ ] memory pressure high
|
- [x] memory pressure high
|
||||||
- [ ] Caddy/Gitea critical services down
|
- vijay: added `/proc/meminfo` memory-pressure threshold check to `scripts/hermes-health-watchdog.py`, deployed to `~/.hermes/scripts/hermes_health_watchdog.py`, and verified silent-on-success.
|
||||||
|
- [x] Caddy/Gitea critical services down
|
||||||
|
- vijay: added critical Docker container checks for `caddy` and `gitea-npm-registry`; deployed watchdog remains silent on a healthy run.
|
||||||
- [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
|
- [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
|
||||||
- vijay: watchdog cron is no-agent/script-only and silent on success.
|
- vijay: watchdog cron is no-agent/script-only and silent on success.
|
||||||
- [x] Keep noisy health checks silent on success.
|
- [x] Keep noisy health checks silent on success.
|
||||||
@ -298,8 +319,8 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
- [x] Do not expose Hermes dashboard publicly.
|
- [x] Do not expose Hermes dashboard publicly.
|
||||||
- vijay: no public dashboard/API route added; private-only policy documented.
|
- vijay: no public dashboard/API route added; private-only policy documented.
|
||||||
- [x] If a dashboard is useful, make it private-only and operationally scoped.
|
- [x] If a dashboard is useful, make it private-only and operationally scoped.
|
||||||
- vijay: selected private-only dashboard direction; Tailscale is connected at `100.87.53.10`. Dashboard itself is not running and no `9119/9120` listener is exposed.
|
- vijay: root dashboard is running as `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`, bound only to the Tailscale IP.
|
||||||
- bheem: Uma dashboard access should use the same private-only Tailscale host path; no Uma dashboard listener is exposed.
|
- bheem: Uma dashboard is running as `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`, bound only to the Tailscale IP.
|
||||||
- [ ] Dashboard should show:
|
- [ ] Dashboard should show:
|
||||||
- [ ] gateway status
|
- [ ] gateway status
|
||||||
- [ ] active sessions
|
- [ ] active sessions
|
||||||
@ -307,9 +328,11 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
- [ ] backup freshness
|
- [ ] backup freshness
|
||||||
- [ ] recent sanitized alerts
|
- [ ] recent sanitized alerts
|
||||||
- [ ] quick links to docs/runbooks
|
- [ ] quick links to docs/runbooks
|
||||||
|
- vijay: root dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending.
|
||||||
|
- bheem: Uma dashboard HTTP endpoint returns `200` over Tailscale; feature-by-feature UI validation remains pending.
|
||||||
- [x] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel.
|
- [x] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel.
|
||||||
- vijay: standing decision is local/Tailscale/SSH-only. Tailnet login is complete; dashboard auth validation remains a future task if the dashboard is started.
|
- vijay: root dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
|
||||||
- bheem: same standing decision for Uma; no public dashboard route should be added.
|
- bheem: Uma dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
|
||||||
- [x] Add a Caddy review step before adding any new hostname.
|
- [x] Add a Caddy review step before adding any new hostname.
|
||||||
- vijay: added Caddy/port review commands to `docs/hermes-operations.md`.
|
- vijay: added Caddy/port review commands to `docs/hermes-operations.md`.
|
||||||
|
|
||||||
@ -319,13 +342,16 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
- [ ] Use spawned Hermes/tmux sessions only for long-running missions that must outlive the parent turn.
|
- [ ] Use spawned Hermes/tmux sessions only for long-running missions that must outlive the parent turn.
|
||||||
- [ ] Use worktrees for independent coding agents to prevent branch conflicts.
|
- [ ] Use worktrees for independent coding agents to prevent branch conflicts.
|
||||||
- [ ] For durable multi-agent coordination, evaluate Hermes Kanban.
|
- [ ] For durable multi-agent coordination, evaluate Hermes Kanban.
|
||||||
- [ ] Document when to use:
|
- [x] Document when to use:
|
||||||
- [ ] direct tool call
|
- [x] direct tool call
|
||||||
- [ ] delegate_task
|
- [x] delegate_task
|
||||||
- [ ] background terminal process
|
- [x] background terminal process
|
||||||
- [ ] cron job
|
- [x] cron job
|
||||||
- [ ] Kanban worker
|
- [x] Kanban worker
|
||||||
- [ ] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions.
|
- vijay: added multi-agent execution convention guidance to `docs/hermes-operations.md`.
|
||||||
|
- [x] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions.
|
||||||
|
- vijay: documented the numbered/emoji-prefix convention in `docs/hermes-operations.md`.
|
||||||
|
- bheem: Uma/Bheem follows the same convention.
|
||||||
|
|
||||||
### Phase 11 — Security And Secret Hygiene
|
### Phase 11 — Security And Secret Hygiene
|
||||||
|
|
||||||
@ -348,8 +374,9 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
- vijay: created `docs/hermes-operations.md`.
|
- vijay: created `docs/hermes-operations.md`.
|
||||||
- [x] Link this roadmap from `docs/repo-map.md`.
|
- [x] Link this roadmap from `docs/repo-map.md`.
|
||||||
- vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map.
|
- vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map.
|
||||||
- [ ] Create or update runbooks for:
|
- [x] Create or update runbooks for:
|
||||||
- [ ] installing/upgrading Hermes
|
- [x] installing/upgrading Hermes
|
||||||
|
- vijay: `docs/hermes-operations.md` contains upgrade commands and late-upgrade verification notes.
|
||||||
- [x] restarting the gateway
|
- [x] restarting the gateway
|
||||||
- [x] restoring persistent data from backup
|
- [x] restoring persistent data from backup
|
||||||
- [x] configuring providers/models
|
- [x] configuring providers/models
|
||||||
@ -384,9 +411,12 @@ A healthy ByteLyst Hermes setup should be:
|
|||||||
|
|
||||||
### Medium-Term — This Month
|
### Medium-Term — This Month
|
||||||
|
|
||||||
- [ ] Evaluate private-only dashboard/mission-control UX.
|
- [x] Evaluate private-only dashboard/mission-control UX.
|
||||||
|
- vijay: root dashboard is reachable via Tailscale at `http://100.87.53.10:9119/`.
|
||||||
|
- bheem: Uma dashboard is reachable via Tailscale at `http://100.87.53.10:9120/`.
|
||||||
- [ ] Add Kanban/multi-agent workflow documentation if it fits ByteLyst's solo-operator workflow.
|
- [ ] Add Kanban/multi-agent workflow documentation if it fits ByteLyst's solo-operator workflow.
|
||||||
- [ ] Add silent-on-success system watchdogs.
|
- [x] Add silent-on-success system watchdogs.
|
||||||
|
- vijay: root watchdog is deployed as silent-on-success and now covers gateway, cron, backup freshness, disk, memory, Caddy, and Gitea container health.
|
||||||
- [ ] Clean up stale memory/skills and pin critical skills.
|
- [ ] Clean up stale memory/skills and pin critical skills.
|
||||||
- [ ] Schedule quarterly restore drills.
|
- [ ] Schedule quarterly restore drills.
|
||||||
- vijay: quarterly restore drill reminder cron is configured for root.
|
- vijay: quarterly restore drill reminder cron is configured for root.
|
||||||
@ -433,6 +463,12 @@ This roadmap is complete when:
|
|||||||
- vijay: confirmed root service is enabled and active.
|
- vijay: confirmed root service is enabled and active.
|
||||||
- bheem: confirmed Uma service is enabled and active; Docker-based Uma Hermes remains removed.
|
- bheem: confirmed Uma service is enabled and active; Docker-based Uma Hermes remains removed.
|
||||||
- vijay: installed Tailscale `1.98.3`; `tailscaled` is enabled/running and authenticated to tailnet IP `100.87.53.10`.
|
- vijay: installed Tailscale `1.98.3`; `tailscaled` is enabled/running and authenticated to tailnet IP `100.87.53.10`.
|
||||||
|
- vijay: installed permanent root dashboard service `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`.
|
||||||
|
- bheem: installed permanent Uma dashboard service `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`.
|
||||||
|
- vijay: added dashboard service unit templates under `systemd/` for repo tracking.
|
||||||
|
- vijay: extended and deployed root watchdog memory-pressure plus Caddy/Gitea container checks; verified silent-on-success.
|
||||||
|
- vijay: reviewed root persistent memories and recurring workflow skills.
|
||||||
|
- bheem: reviewed Uma persistent memories and recurring workflow skills.
|
||||||
- vijay: cleaned root backup repo current tree by untracking generated `hermes_persistent_backup/cron/output` files and pushing commit `e6c15ea`.
|
- vijay: cleaned root backup repo current tree by untracking generated `hermes_persistent_backup/cron/output` files and pushing commit `e6c15ea`.
|
||||||
- bheem: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal.
|
- bheem: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal.
|
||||||
- vijay: ran root restore rehearsal into `/tmp/hermes-restore-test-root`, verified portable restore content, and scanned restored config/template for common token patterns.
|
- vijay: ran root restore rehearsal into `/tmp/hermes-restore-test-root`, verified portable restore content, and scanned restored config/template for common token patterns.
|
||||||
|
|||||||
@ -14,9 +14,15 @@ from datetime import datetime, timezone
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85"))
|
DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85"))
|
||||||
|
MEMORY_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_MEMORY_WARN_PERCENT", "90"))
|
||||||
BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90"))
|
BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90"))
|
||||||
BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub")
|
BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub")
|
||||||
GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service")
|
GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service")
|
||||||
|
DOCKER_CONTAINERS = [
|
||||||
|
item.strip()
|
||||||
|
for item in os.getenv("HERMES_WATCHDOG_DOCKER_CONTAINERS", "caddy,gitea-npm-registry").split(",")
|
||||||
|
if item.strip()
|
||||||
|
]
|
||||||
HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes")))
|
HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes")))
|
||||||
|
|
||||||
|
|
||||||
@ -73,9 +79,49 @@ def check_disk(alerts: list[str]) -> None:
|
|||||||
alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)")
|
alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)")
|
||||||
|
|
||||||
|
|
||||||
|
def check_memory(alerts: list[str]) -> None:
|
||||||
|
meminfo: dict[str, int] = {}
|
||||||
|
for line in Path("/proc/meminfo").read_text(encoding="utf-8").splitlines():
|
||||||
|
parts = line.split()
|
||||||
|
if len(parts) >= 2:
|
||||||
|
meminfo[parts[0].rstrip(":")] = int(parts[1])
|
||||||
|
total = meminfo.get("MemTotal", 0)
|
||||||
|
available = meminfo.get("MemAvailable", 0)
|
||||||
|
if total <= 0 or available <= 0:
|
||||||
|
alerts.append("could not read memory pressure from /proc/meminfo")
|
||||||
|
return
|
||||||
|
used_pct = int(round(((total - available) / total) * 100))
|
||||||
|
if used_pct >= MEMORY_WARN_PERCENT:
|
||||||
|
alerts.append(f"memory pressure is high: {used_pct}% used (threshold {MEMORY_WARN_PERCENT}%)")
|
||||||
|
|
||||||
|
|
||||||
|
def check_docker_containers(alerts: list[str]) -> None:
|
||||||
|
if not DOCKER_CONTAINERS:
|
||||||
|
return
|
||||||
|
docker = shutil.which("docker")
|
||||||
|
if not docker:
|
||||||
|
alerts.append("docker executable not found; cannot verify critical containers")
|
||||||
|
return
|
||||||
|
result = run([docker, "ps", "--format", "{{.Names}}"], timeout=20)
|
||||||
|
if result.returncode != 0:
|
||||||
|
alerts.append(f"`docker ps` failed while checking critical containers: {result.stderr.strip() or result.stdout.strip()}")
|
||||||
|
return
|
||||||
|
running = set(result.stdout.splitlines())
|
||||||
|
missing = [name for name in DOCKER_CONTAINERS if name not in running]
|
||||||
|
if missing:
|
||||||
|
alerts.append(f"critical Docker container(s) not running: {', '.join(missing)}")
|
||||||
|
|
||||||
|
|
||||||
def main() -> int:
|
def main() -> int:
|
||||||
alerts: list[str] = []
|
alerts: list[str] = []
|
||||||
for check in (check_gateway, check_backup_cron, check_backup_repo_freshness, check_disk):
|
for check in (
|
||||||
|
check_gateway,
|
||||||
|
check_backup_cron,
|
||||||
|
check_backup_repo_freshness,
|
||||||
|
check_disk,
|
||||||
|
check_memory,
|
||||||
|
check_docker_containers,
|
||||||
|
):
|
||||||
try:
|
try:
|
||||||
check(alerts)
|
check(alerts)
|
||||||
except Exception as exc: # noqa: BLE001 - watchdog should alert, not crash silently
|
except Exception as exc: # noqa: BLE001 - watchdog should alert, not crash silently
|
||||||
@ -85,7 +131,10 @@ def main() -> int:
|
|||||||
print("🚨 ByteLyst Hermes watchdog alert")
|
print("🚨 ByteLyst Hermes watchdog alert")
|
||||||
for item in alerts:
|
for item in alerts:
|
||||||
print(f"- {item}")
|
print(f"- {item}")
|
||||||
print("\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, `hermes cron list`, `df -h /`.")
|
print(
|
||||||
|
"\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, "
|
||||||
|
"`hermes cron list`, `df -h /`, `free -h`, `docker ps`."
|
||||||
|
)
|
||||||
return 0
|
return 0
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|||||||
21
systemd/hermes-root-dashboard.service
Normal file
21
systemd/hermes-root-dashboard.service
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Root Hermes Dashboard (Tailscale private)
|
||||||
|
After=network-online.target tailscaled.service
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=root
|
||||||
|
Group=root
|
||||||
|
WorkingDirectory=/usr/local/lib/hermes-agent
|
||||||
|
Environment="HOME=/root"
|
||||||
|
Environment="USER=root"
|
||||||
|
Environment="LOGNAME=root"
|
||||||
|
Environment="HERMES_HOME=/root/.hermes"
|
||||||
|
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||||||
|
ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9119 --no-open --insecure --skip-build
|
||||||
|
Restart=always
|
||||||
|
RestartSec=5
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
21
systemd/uma-hermes-dashboard.service
Normal file
21
systemd/uma-hermes-dashboard.service
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Uma Hermes Dashboard (Tailscale private)
|
||||||
|
After=network-online.target tailscaled.service
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=uma
|
||||||
|
Group=uma
|
||||||
|
WorkingDirectory=/usr/local/lib/hermes-agent
|
||||||
|
Environment="HOME=/home/uma"
|
||||||
|
Environment="USER=uma"
|
||||||
|
Environment="LOGNAME=uma"
|
||||||
|
Environment="HERMES_HOME=/home/uma/.hermes"
|
||||||
|
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||||||
|
ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9120 --no-open --insecure --skip-build
|
||||||
|
Restart=always
|
||||||
|
RestartSec=5
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
Loading…
Reference in New Issue
Block a user