bytelyst-devops-tools/systemd
Hermes VM 13a105ba23 feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert
vm-health-check.sh:
- check_gpu(): nvidia-smi probe; "CPU-only" OK on this VM (no GPU)
- check_image_freshness(): flag containers running images >30d old.
  Skips third-party images (gitea, grafana, prom, mcr.microsoft, axllent,
  caddy, traefik, valkey, cadvisor) — they have their own rebuild cadence.
  Currently flags 19 stale product images (~60d old).

chaos-validation.sh:
- Monthly chaos test: kill PID 1 in chronomind-web, wait up to 35 min
  for docker-health-watchdog to detect + restart. Telegram pass/fail.
- Refuses to run if target not healthy. systemd timer fires 1st of month
  at 10:00 UTC (after 08:00 weekly digest).

vm-io-anomaly-check.sh:
- 6h avg sda write rate; transition alerts at WARN (1 GB/hr) /
  CRIT (2.5 GB/hr). De-dupes via /var/log/vm-io-anomaly-state so the
  alert fires once per transition, not every 6h. Current baseline:
  ~1.94 GB/hr (orphan-container state-file writes; see Phase 0.3).
- Reports recovery to OK when rate drops back.

vm/page.tsx: gpu + image_freshness added to CHECK_META so they render
with proper icon/label and slot into CHECK_ORDER.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
..
bytelyst-gitea-backup.service feat: add gitea backup timer assets 2026-05-27 18:53:20 +00:00
bytelyst-gitea-backup.timer feat: add gitea backup timer assets 2026-05-27 18:53:20 +00:00
chaos-validation.service feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert 2026-05-30 05:26:49 +00:00
chaos-validation.timer feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert 2026-05-30 05:26:49 +00:00
docker-health-watchdog.service feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog 2026-05-27 21:31:09 +00:00
docker-health-watchdog.timer feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog 2026-05-27 21:31:09 +00:00
hermes-emergency-drive-upload.service Add Google Drive emergency bundle upload 2026-05-27 12:08:41 +00:00
hermes-emergency-drive-upload.timer Add Google Drive emergency bundle upload 2026-05-27 12:08:41 +00:00
hermes-gateway.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
hermes-root-backup.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
hermes-root-backup.timer Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
hermes-root-dashboard.service Complete Hermes dashboard and watchdog roadmap audit 2026-05-27 10:45:29 +00:00
uma-hermes-backup.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
uma-hermes-backup.timer Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
uma-hermes-dashboard.service Complete Hermes dashboard and watchdog roadmap audit 2026-05-27 10:45:29 +00:00
uma-hermes-gateway.service Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
vm-io-anomaly-check.service feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert 2026-05-30 05:26:49 +00:00
vm-io-anomaly-check.timer feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert 2026-05-30 05:26:49 +00:00
vm-oom-watchdog.service feat(vm): Phase 2.3 closure — OOM watchdog + orphan-container docs 2026-05-30 05:26:49 +00:00
vm-oom-watchdog.timer feat(vm): Phase 2.3 closure — OOM watchdog + orphan-container docs 2026-05-30 05:26:49 +00:00
vm-weekly-digest.service feat(dashboard/vm): Phases 4.1-4.3 — Prometheus trends, sparklines, weekly digest 2026-05-30 05:26:49 +00:00
vm-weekly-digest.timer feat(dashboard/vm): Phases 4.1-4.3 — Prometheus trends, sparklines, weekly digest 2026-05-30 05:26:49 +00:00