bytelyst-devops-tools/systemd
Hermes VM d9618ba7b0 feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog
Phase 1.2 — CPU steal time metric in vm-health-check.sh:
- Samples /proc/stat twice 1s apart for accurate current steal %
- Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host)
- Inserts before memory check so steal is visible alongside load

Phase 1.4 — Swap pressure indicator:
- Reads SwapCached from /proc/meminfo as secondary metric
- Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G)
- New WARN path: SwapCached > 200MB signals recent pressure even when
  current swap usage looks ok (catches post-spike state)

Phase 2.1 — Docker health-check watchdog:
- docker-health-watchdog.sh: checks unhealthy containers every 10 min,
  restarts only after 3 consecutive failing health checks (30min grace)
- docker-health-watchdog.service + .timer: enabled, fires every 10 min
- Sends Telegram notification on each auto-restart
- Rollback: systemctl disable docker-health-watchdog.timer

Phase 2.2 already complete: sync_hermes_persistent_backup.py handles
diverge gracefully with rebase/reset-hard fallback; running successfully.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:31:09 +00:00
..
bytelyst-gitea-backup.service feat: add gitea backup timer assets 2026-05-27 18:53:20 +00:00
bytelyst-gitea-backup.timer feat: add gitea backup timer assets 2026-05-27 18:53:20 +00:00
docker-health-watchdog.service feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog 2026-05-27 21:31:09 +00:00
docker-health-watchdog.timer feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog 2026-05-27 21:31:09 +00:00
hermes-emergency-drive-upload.service Add Google Drive emergency bundle upload 2026-05-27 12:08:41 +00:00
hermes-emergency-drive-upload.timer Add Google Drive emergency bundle upload 2026-05-27 12:08:41 +00:00
hermes-gateway.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
hermes-root-backup.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
hermes-root-backup.timer Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
hermes-root-dashboard.service Complete Hermes dashboard and watchdog roadmap audit 2026-05-27 10:45:29 +00:00
uma-hermes-backup.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
uma-hermes-backup.timer Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
uma-hermes-dashboard.service Complete Hermes dashboard and watchdog roadmap audit 2026-05-27 10:45:29 +00:00
uma-hermes-gateway.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00