Phase 1.2 — CPU steal time metric in vm-health-check.sh: - Samples /proc/stat twice 1s apart for accurate current steal % - Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host) - Inserts before memory check so steal is visible alongside load Phase 1.4 — Swap pressure indicator: - Reads SwapCached from /proc/meminfo as secondary metric - Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G) - New WARN path: SwapCached > 200MB signals recent pressure even when current swap usage looks ok (catches post-spike state) Phase 2.1 — Docker health-check watchdog: - docker-health-watchdog.sh: checks unhealthy containers every 10 min, restarts only after 3 consecutive failing health checks (30min grace) - docker-health-watchdog.service + .timer: enabled, fires every 10 min - Sends Telegram notification on each auto-restart - Rollback: systemctl disable docker-health-watchdog.timer Phase 2.2 already complete: sync_hermes_persistent_backup.py handles diverge gracefully with rebase/reset-hard fallback; running successfully. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
13 lines
313 B
Desktop File
13 lines
313 B
Desktop File
[Unit]
|
|
Description=Restart Docker containers stuck in unhealthy state
|
|
Documentation=file:///usr/local/bin/docker-health-watchdog.sh
|
|
After=docker.service
|
|
Requires=docker.service
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
User=root
|
|
Group=root
|
|
Environment="HERMES_HOME=/root/.hermes"
|
|
ExecStart=/usr/local/bin/docker-health-watchdog.sh
|