bytelyst-devops-tools

Author	SHA1	Message	Date
Hermes VM	13a105ba23	feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert vm-health-check.sh: - check_gpu(): nvidia-smi probe; "CPU-only" OK on this VM (no GPU) - check_image_freshness(): flag containers running images >30d old. Skips third-party images (gitea, grafana, prom, mcr.microsoft, axllent, caddy, traefik, valkey, cadvisor) — they have their own rebuild cadence. Currently flags 19 stale product images (~60d old). chaos-validation.sh: - Monthly chaos test: kill PID 1 in chronomind-web, wait up to 35 min for docker-health-watchdog to detect + restart. Telegram pass/fail. - Refuses to run if target not healthy. systemd timer fires 1st of month at 10:00 UTC (after 08:00 weekly digest). vm-io-anomaly-check.sh: - 6h avg sda write rate; transition alerts at WARN (1 GB/hr) / CRIT (2.5 GB/hr). De-dupes via /var/log/vm-io-anomaly-state so the alert fires once per transition, not every 6h. Current baseline: ~1.94 GB/hr (orphan-container state-file writes; see Phase 0.3). - Reports recovery to OK when rate drops back. vm/page.tsx: gpu + image_freshness added to CHECK_META so they render with proper icon/label and slot into CHECK_ORDER. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	d9618ba7b0	feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog Phase 1.2 — CPU steal time metric in vm-health-check.sh: - Samples /proc/stat twice 1s apart for accurate current steal % - Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host) - Inserts before memory check so steal is visible alongside load Phase 1.4 — Swap pressure indicator: - Reads SwapCached from /proc/meminfo as secondary metric - Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G) - New WARN path: SwapCached > 200MB signals recent pressure even when current swap usage looks ok (catches post-spike state) Phase 2.1 — Docker health-check watchdog: - docker-health-watchdog.sh: checks unhealthy containers every 10 min, restarts only after 3 consecutive failing health checks (30min grace) - docker-health-watchdog.service + .timer: enabled, fires every 10 min - Sends Telegram notification on each auto-restart - Rollback: systemctl disable docker-health-watchdog.timer Phase 2.2 already complete: sync_hermes_persistent_backup.py handles diverge gracefully with rebase/reset-hard fallback; running successfully. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 21:31:09 +00:00
Hermes VM	9210a8890f	feat: detect stale VM automation Some checks failed pre-commit / pre-commit (push) Failing after 32s Details	2026-05-27 21:00:43 +00:00
Hermes VM	0a2d303f93	add HostingerVM health-check and cleanup scripts - vm-health-check.sh: read-only checks for disk, load, RAM, swap, Docker containers (crash-loops + healthchecks), build cache, journal. Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT. - vm-cleanup.sh: safe periodic cleanup. Default (weekly): build cache, journal, apt, npm, .next/cache. --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup. --dry-run, --install-cron, --uninstall-cron. Logs to /var/log/vm-cleanup.log. Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 18:53:20 +00:00

4 Commits