add HostingerVM health-check and cleanup scripts

- vm-health-check.sh: read-only checks for disk, load, RAM, swap, Docker containers (crash-loops + healthchecks), build cache, journal. Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT. - vm-cleanup.sh: safe periodic cleanup. Default (weekly): build cache, journal, apt, npm, .next/cache. --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup. --dry-run, --install-cron, --uninstall-cron. Logs to /var/log/vm-cleanup.log. Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:02:19 +00:00 · 2026-05-27 12:02:19 +00:00 · 0a2d303f93
commit 0a2d303f93
parent 4249b17afc
4 changed files with 1099 additions and 0 deletions
--- a/docs/hostinger-vm-maintenance.md
+++ b/docs/hostinger-vm-maintenance.md
@ -0,0 +1,235 @@
 # Hostinger VM — Maintenance & Incident Reference
 **VM:** `srv1491630.hstgr.cloud` · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk
 **Key services:** `hermes-gateway`, `ollama`, Docker (~40 containers), `learning_ai_common_plat` stack
 ---
 ## Quick-start for day-to-day ops
 ```bash
 # Check VM health (read-only, safe any time)
 bash scripts/VMs/HostingerVM/vm-health-check.sh
 # Weekly safe cleanup
 bash scripts/VMs/HostingerVM/vm-cleanup.sh
 # Monthly deeper cleanup
 bash scripts/VMs/HostingerVM/vm-cleanup.sh --full
 # Cron setup (run once)
 bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
 ```
 See [`CRON_SETUP.md`](../scripts/VMs/HostingerVM/CRON_SETUP.md) for full details.
 ---
 ## Incident Report — Load Average 1305 (2026-05-26)
 ### What happened
 The VM became completely unresponsive. Load average reached **1305** (normal < 4 on 4 CPUs).
 ```
 load average: 1305.54, 1339.23, 1302.41
 RAM: 13 / 15 GB used, ZERO swap configured
 ```
 **Single root cause:** one broken Docker container crash-looped **1,336 times** over ~25 hours.
 Container: `learning_ai_common_plat-admin-web-1`
 Error: `Cannot find module '/app/server.js'`
 Restart policy: `unless-stopped` (no backoff limit, retries forever)
 Each restart spawned ~3 OS processes:
 - `containerd-shim-runc-v2`
 - veth network interface creation
 - `networkctl` call for the new interface
 With 1,336 restarts × ~3 procs = **~4,000 processes** — the kernel scheduler thrashed → load 1305.
 ### Why the container was broken
 The `admin-web` Docker image had no `server.js` because its Next.js build failed silently. Three bugs stacked:
 | Bug | File | Detail |
 |-----|------|--------|
 | Missing build secret | `docker-compose.ecosystem.yml` | `admin-web` service was missing `<<: *product-build` anchor, so `GITEA_NPM_TOKEN` was never passed as a BuildKit secret → `pnpm install` of `@bytelyst/*` packages failed |
 | Missing COPY step | `dashboards/admin-web/Dockerfile` | `tsconfig.base.json` (monorepo root) was not copied into the build context → `tsc` couldn't find it → build failed |
 | Wrong pnpm flag | `dashboards/admin-web/Dockerfile` | `--legacy-peer-deps` is an npm flag, not valid in pnpm 10 → install step exited early |
 Because the build stage failed, `COPY --from=builder .next/standalone ./` copied nothing, leaving the runner stage with an empty `/app` — no `server.js`.
 ### Timeline
 | Time (UTC) | Event |
 |---|---|
 | 2026-05-26 04:43 | VM booted, Docker started |
 | 2026-05-26 04:56 | `admin-web` first restart (count=1) |
 | 2026-05-26 ~05:00–06:07 | Load climbs steadily, RAM fills |
 | 2026-05-26 ~ongoing | 1,336 restarts over 25 hours |
 | 2026-05-27 06:07 | VM rebooted (load avg recorded: 1305) |
 | 2026-05-27 06:28 | Diagnosis session started (load: 0.55 after reboot) |
 | 2026-05-27 08:20 | All fixes applied, cleanup complete |
 ### Secondary problems found
 | Issue | Detail |
 |---|---|
 | **No swap** | Zero swap configured — OOM kills inevitable under memory pressure |
 | **84 GB Docker build cache** | Never pruned; Next.js/TSC builds accumulate enormous layer cache |
 | **12 GB HOLD node_modules** | Archived projects in `/opt/bytelyst/HOLD` had deps never cleaned up |
 | **~3 GB .next/cache** | Build-time caches in active and HOLD repos |
 | **381 MB uncompressed logs** | `syslog.1`, `kern.log.1` not compressed; no size/retention limits on journal |
 | **No crash-loop detection** | Nothing alerting on containers restarting > N times |
 ---
 ## Fixes Applied (2026-05-27)
 ### 1. Crash loop — stopped
 Patched `/var/lib/docker/containers/2219091e.../hostconfig.json` while Docker was stopped:
 ```json
 "RestartPolicy": {"Name": "no", "MaximumRetryCount": 0}
 ```
 Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling.
 ### 2. Swap — added
 ```bash
 fallocate -l 4G /swapfile
 chmod 600 /swapfile
 mkswap /swapfile
 swapon /swapfile
 echo '/swapfile none swap sw 0 0' >> /etc/fstab
 sysctl vm.swappiness=10
 echo 'vm.swappiness=10' >> /etc/sysctl.conf
 ```
 ### 3. Disk — 79 GB reclaimed (70% → 27%)
 | Action | Freed |
 |---|---|
 | `docker builder prune -f` | 84 GB |
 | `docker system prune -f` | 107 MB |
 | HOLD node_modules deleted | ~12 GB |
 | HOLD `.next` build caches | ~1.2 GB |
 | Active `.next/cache` dirs | ~2.4 GB |
 | Old Claude CLI versions | ~940 MB |
 | npm cache clean | ~1.8 GB |
 | Journal vacuum | ~220 MB |
 | apt clean | ~280 MB |
 ### 4. Log management
 `/etc/systemd/journald.conf.d/size-limits.conf`:
 ```ini
 [Journal]
 SystemMaxUse=200M
 SystemKeepFree=1G
 MaxRetentionSec=7day
 MaxFileSec=1day
 ```
 `/etc/rsyslog.d/20-ufw-filter.conf`:
 ```
 :msg, contains, "[UFW BLOCK]" stop
 ```
 `/etc/logrotate.d/rsyslog-custom`: daily rotation, 7-day retention, compress-on-rotate.
 ### 5. Dockerfile fixes (ready, not yet deployed)
 `docker-compose.ecosystem.yml` — added `<<: *product-build` to `admin-web` build section
 `dashboards/admin-web/Dockerfile` — added `tsconfig.base.json` to COPY, removed `--legacy-peer-deps`
 ---
 ## Deploying admin-web (when ready)
 ```bash
 cd /opt/bytelyst/learning_ai_common_plat
 GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \
  docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
  build admin-web
 # Verify the standalone build was produced:
 docker run --rm --entrypoint ls \
  learning_ai_common_plat-admin-web:latest /app | grep server.js
 # Start it:
 docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
  up -d admin-web
 ```
 The container's restart policy will be set by the compose file (`unless-stopped`). Once the image is healthy, this is safe.
 ---
 ## Ongoing health targets
 | Metric | Healthy | Warning | Critical |
 |---|---|---|---|
 | Disk usage `/` | < 55% | 55–70% | > 70% |
 | Load average | < 4.0 | 4–8 | > 8 |
 | Available RAM | > 3 GB | 1–3 GB | < 1 GB |
 | Swap used | < 1 GB | 1–3 GB | > 3 GB |
 | Container restart count | < 5 | 5–20 | > 20 |
 | Docker build cache | < 5 GB | 5–20 GB | > 20 GB |
 ---
 ## Reference: safe cleanup commands
 ```bash
 # Always safe (just prunes unreferenced build layers)
 docker builder prune -f
 # Safe: removes stopped containers, unused networks, dangling images only
 docker system prune -f
 # Safe: removes packages not referenced by any installed node_modules
 pnpm store prune
 # Safe: vacuum journal to size limit
 journalctl --vacuum-size=200M
 # Safe: clear apt cache
 apt-get clean
 # Safe: clear npm cache
 npm cache clean --force
 # Careful: removes ALL images not used by a running container (rebuilds needed)
 docker image prune -a -f
 ```
 ---
 ## Crash-loop detection (manual check)
 ```bash
 # Show containers that have restarted more than 10 times
 docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \
  | awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}'
 # Show container logs for any that are restarting
 docker events --filter event=restart --since 1h
 ```
 The `vm-health-check.sh` script runs these checks automatically.
 ---
 ## Related scripts
 | Script | Purpose |
 |---|---|
 | `scripts/VMs/HostingerVM/vm-health-check.sh` | Daily read-only health check + alerts |
 | `scripts/VMs/HostingerVM/vm-cleanup.sh` | Periodic safe cleanup |
 | `scripts/VMs/HostingerVM/CRON_SETUP.md` | Cron wiring |
 | `scripts/ubuntu-vm-security-update.sh` | Security patching |
 | `scripts/VMs/HostingerVM/login.sh` | SSH into the VM |
--- a/scripts/VMs/HostingerVM/CRON_SETUP.md
+++ b/scripts/VMs/HostingerVM/CRON_SETUP.md
@ -0,0 +1,150 @@
 # Hostinger VM — Cron Setup
 Automated maintenance schedule for `srv1491630`.
 Scripts: `vm-health-check.sh` (read-only) + `vm-cleanup.sh` (safe cleanup).
 ---
 ## Quick install
 SSH into the VM and run:
 ```bash
 bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
 ```
 This installs the full recommended schedule. To remove it:
 ```bash
 bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --uninstall-cron
 ```
 ---
 ## What gets scheduled
 | Schedule | Time (UTC) | Command | What it does |
 |---|---|---|---|
 | Daily | 07:00 | `vm-health-check.sh` | Read-only check; sends Telegram alert on WARNING/CRITICAL |
 | Daily | 03:00 | `vm-cleanup.sh` | Prune Docker build cache only (always safe) |
 | Weekly | Sun 02:00 | `vm-cleanup.sh` | Standard cleanup (see below) |
 | Monthly | 1st 01:00 | `vm-cleanup.sh --full` | Full cleanup (see below) |
 ---
 ## What each mode does
 ### Standard weekly cleanup (`vm-cleanup.sh`)
 All steps are labelled **SAFE** — they only remove regenerable caches.
 | Step | What's removed | Risk |
 |---|---|---|
 | Docker build cache | Layer cache from `docker build` runs | Zero — rebuilds just take longer next time |
 | Crash loop check | Detection only, no changes | Zero |
 | Journal vacuum | Old journal entries beyond 200MB / 7 days | Zero — logs are already captured in syslog |
 | APT cache | `/var/cache/apt/archives/` | Zero — packages can be re-downloaded |
 | NPM cache | `~/.npm/_cacache/` | Zero — cache is re-populated on next `npm install` |
 | `.next/cache` | Webpack/babel/TSC build cache dirs | Zero — rebuilt automatically on next `next build` |
 ### Monthly full cleanup (`vm-cleanup.sh --full`)
 Adds these **CAREFUL** steps on top of the standard run:
 | Step | What's removed | Risk |
 |---|---|---|
 | Docker system prune | Stopped containers, unused networks, dangling images | Low — does NOT remove images used by any container |
 | pnpm store prune | Packages not referenced by any `node_modules` | Low — only removes truly orphaned packages |
 | Old log files | `.gz` log rotations older than 30 days | Low — old compressed logs |
 | HOLD node_modules | `node_modules` in `/opt/bytelyst/HOLD` archived projects | Low — code intact, can reinstall with `pnpm install` |
 ### Never touched (by design)
 - `/opt/bytelyst/*/node_modules` (active repos)
 - `/opt/bytelyst/*/src`, `/app`, `/backend`, `/web` source code
 - `.next/standalone` (production Next.js builds)
 - Docker images used by currently configured containers
 - `/usr/local/lib/hermes-agent/`
 - `/usr/share/ollama/` (models)
 - `/swapfile`
 - Any database volumes
 ---
 ## Manual crontab (if you prefer not to use --install-cron)
 ```
 # Health check daily 07:00 UTC
 0 7 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --quiet --notify 2>&1 | logger -t vm-health
 # Build cache prune daily 03:00 UTC
 0 3 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup
 # Standard weekly cleanup Sunday 02:00 UTC
 0 2 * * 0 bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup
 # Full monthly cleanup 1st of month 01:00 UTC
 0 1 1 * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --full --quiet 2>&1 | logger -t vm-cleanup
 ```
 Edit with: `crontab -e`
 ---
 ## Monitoring logs
 ```bash
 # Tail cleanup log
 tail -f /var/log/vm-cleanup.log
 # Tail health check log
 tail -f /var/log/vm-health-check.log
 # See all cron output via syslog
 grep vm-cleanup /var/log/syslog | tail -20
 grep vm-health /var/log/syslog | tail -20
 ```
 ---
 ## Telegram alerts
 The health check script sends a Telegram message when it detects WARNING or CRITICAL.
 It reads credentials from `$HERMES_HOME/.env` (usually `/root/.hermes/.env`).
 Required keys in that file:
 ```
 TELEGRAM_BOT_TOKEN=<your-bot-token>
 TELEGRAM_CHAT_ID=<your-chat-id>
 ```
 Both are already set if Hermes gateway is configured. Test with:
 ```bash
 bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --notify
 ```
 ---
 ## Disk thresholds (from `vm-health-check.sh`)
 | Metric | WARNING | CRITICAL |
 |---|---|---|
 | Disk used `%` | > 55% | > 70% |
 | Load average | > 4.0 | > 8.0 |
 | RAM available | < 3 GB | < 1 GB |
 | Swap used | > 1 GB | > 3 GB |
 | Container restarts | > 10 | > 50 |
 | Build cache | > 5 GB | > 20 GB |
 Thresholds are constants at the top of each script — easy to adjust.
 ---
 ## What the May 2026 incident would have caught
 If this cron had been running during the May 26 incident:
 - **07:00 daily health check** → `container_loops CRIT: admin-web(50x)` → Telegram alert sent within hours of the loop starting
 - **03:00 daily build cache prune** → would have kept build cache under 5 GB instead of growing to 84 GB
 - **Monthly full cleanup** → would have cleared the HOLD node_modules and old logs before they became a storage crisis
--- a/scripts/VMs/HostingerVM/vm-cleanup.sh
+++ b/scripts/VMs/HostingerVM/vm-cleanup.sh
@ -0,0 +1,377 @@
 #!/usr/bin/env bash
 # =============================================================================
 # vm-cleanup.sh — Hostinger VM Safe Periodic Cleanup
 #
 # Designed to be run manually or via cron. All operations are either
 # completely safe (read-only builds) or will prompt for confirmation when
 # removing things that can't be trivially regenerated.
 #
 # Modes:
 #   (default)      Weekly safe cleanup — build cache, apt, npm, journal, .next/cache
 #   --full         Monthly deeper cleanup — adds: pnpm store, docker system prune,
 #                  old log files, Docker image dangling prune
 #   --dry-run      Print what would be done, make no changes
 #   --install-cron Install the recommended cron schedule for both scripts
 #   --uninstall-cron  Remove the installed cron jobs
 #
 # All destructive steps are gated behind SAFE / CAREFUL / MANUAL labels
 # in the output so you can audit what ran.
 #
 # Logs to: /var/log/vm-cleanup.log
 # =============================================================================
 set -Eeuo pipefail
 # ── Config ───────────────────────────────────────────────────────────────────
 LOG_FILE="/var/log/vm-cleanup.log"
 SCRIPT_PATH="$(realpath "${BASH_SOURCE[0]}")"
 SCRIPT_DIR="$(dirname "$SCRIPT_PATH")"
 HEALTH_CHECK="$SCRIPT_DIR/vm-health-check.sh"
 # Paths that must NEVER be deleted even in --full mode
 # shellcheck disable=SC2034
 PROTECTED_PATHS=(
  "/opt/bytelyst/learning_ai_common_plat"
  "/opt/bytelyst/learning_ai_devops_tools"
  "/usr/local/lib/hermes-agent"
  "/usr/share/ollama"
  "/swapfile"
 )
 # node_modules dirs in active (non-HOLD) repos to never touch
 # shellcheck disable=SC2034
 ACTIVE_NODE_MODULES=(
  "/opt/bytelyst/learning_ai_common_plat/node_modules"
  "/opt/bytelyst/learning_ai_flowmonk/node_modules"
  "/opt/bytelyst/learning_ai_clock/node_modules"
  "/opt/bytelyst/learning_ai_notes/node_modules"
  "/opt/bytelyst/learning_ai_devops_tools/dashboard/node_modules"
  "/opt/bytelyst/learning_ai_invt_trdg/node_modules"
 )
 # ── Colour codes ─────────────────────────────────────────────────────────────
 RED=$'\033[0;31m'
 YELLOW=$'\033[1;33m'
 GREEN=$'\033[0;32m'
 CYAN=$'\033[0;36m'
 BOLD=$'\033[1m'
 DIM=$'\033[2m'
 NC=$'\033[0m'
 # ── Flags ────────────────────────────────────────────────────────────────────
 FULL_MODE=false
 DRY_RUN=false
 INSTALL_CRON=false
 UNINSTALL_CRON=false
 QUIET=false
 for arg in "$@"; do
  case "$arg" in
    --full)            FULL_MODE=true ;;
    --dry-run)         DRY_RUN=true ;;
    --install-cron)    INSTALL_CRON=true ;;
    --uninstall-cron)  UNINSTALL_CRON=true ;;
    --quiet)           QUIET=true ;;
  esac
 done
 # ── Helpers ──────────────────────────────────────────────────────────────────
 log() {
  local msg
  msg="[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*"
  echo "$msg" >> "$LOG_FILE" 2>/dev/null || true
  $QUIET || echo -e "$*"
 }
 log_header() {
  $QUIET || echo -e "\n${BOLD}${CYAN}── $1 ──────────────────────────────────────${NC}"
  echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] === $1 ===" >> "$LOG_FILE" 2>/dev/null || true
 }
 log_step() {
  local label="$1" msg="$2"
  case "$label" in
    SAFE)    $QUIET || echo -e "  ${GREEN}[SAFE]${NC}    $msg" ;;
    CAREFUL) $QUIET || echo -e "  ${YELLOW}[CAREFUL]${NC} $msg" ;;
    SKIP)    $QUIET || echo -e "  ${DIM}[SKIP]${NC}    $msg" ;;
    DRY)     $QUIET || echo -e "  ${CYAN}[DRY-RUN]${NC} $msg" ;;
  esac
 }
 run_cmd() {
  # run_cmd LABEL "description" cmd args...
  local label="$1" desc="$2"
  shift 2
  log_step "$label" "$desc"
  if $DRY_RUN; then
    log_step DRY "would run: $*"
    return 0
  fi
  log "[CMD] $*"
  "$@" >> "$LOG_FILE" 2>&1 || true
 }
 disk_before=""
 record_disk_before() {
  disk_before=$(df -h / --output=used,avail,pcent | tail -1 | tr -s ' ')
 }
 report_disk_delta() {
  local disk_after
  disk_after=$(df -h / --output=used,avail,pcent | tail -1 | tr -s ' ')
  if ! $QUIET; then
    echo -e "\n  ${DIM}Before: $disk_before${NC}"
    echo -e "  ${GREEN}After:  $disk_after${NC}"
  fi
  log "[DISK] before=$disk_before after=$disk_after"
 }
 # ── Safety guard ─────────────────────────────────────────────────────────────
 require_root() {
  if [[ "$(id -u)" -ne 0 ]]; then
    echo -e "${RED}ERROR: This script must be run as root (use sudo)${NC}" >&2
    exit 1
  fi
 }
 # ── Cron install/uninstall ───────────────────────────────────────────────────
 do_install_cron() {
  echo -e "\n${BOLD}Installing cron schedule…${NC}\n"
  local cron_tag="# bytelyst-vm-maintenance"
  local tmp_cron
  tmp_cron=$(mktemp)
  # Export existing crontab (minus our managed block)
  crontab -l 2>/dev/null | grep -v "$cron_tag" > "$tmp_cron" || true
  cat >> "$tmp_cron" <<EOF
 $cron_tag — DO NOT EDIT this block manually, use --install-cron / --uninstall-cron
 # Daily health check at 07:00 UTC (read-only, sends Telegram alert on WARNING/CRITICAL)
 0 7 * * * bash $SCRIPT_PATH --quiet 2>&1 | logger -t vm-cleanup
 # Daily build cache prune at 03:00 UTC (always safe, never removes images)
 0 3 * * * bash $SCRIPT_PATH --quiet 2>&1 | logger -t vm-cleanup
 # Weekly cleanup (Sunday 02:00 UTC) — logs, apt, npm, .next/cache, build cache
 0 2 * * 0 bash $SCRIPT_PATH --quiet 2>&1 | logger -t vm-cleanup
 # Monthly full cleanup (1st of month 01:00 UTC) — adds pnpm store, docker system prune
 0 1 1 * * bash $SCRIPT_PATH --full --quiet 2>&1 | logger -t vm-cleanup
 EOF
  crontab "$tmp_cron"
  rm -f "$tmp_cron"
  echo -e "  ${GREEN}✓ Cron jobs installed. Current schedule:${NC}"
  echo ""
  crontab -l | grep -A20 "$cron_tag" || true
  echo ""
  echo -e "  View logs:  ${CYAN}tail -f $LOG_FILE${NC}"
  echo -e "  View cron:  ${CYAN}crontab -l${NC}"
  echo -e "  Remove:     ${CYAN}bash $SCRIPT_PATH --uninstall-cron${NC}"
 }
 do_uninstall_cron() {
  local cron_tag="# bytelyst-vm-maintenance"
  local tmp_cron
  tmp_cron=$(mktemp)
  crontab -l 2>/dev/null | grep -v "$cron_tag" > "$tmp_cron" || true
  # Also strip the actual cron lines we added (they follow the tag block)
  grep -v "vm-health-check.sh\|vm-cleanup.sh" "$tmp_cron" > "${tmp_cron}.clean" || true
  crontab "${tmp_cron}.clean"
  rm -f "$tmp_cron" "${tmp_cron}.clean"
  echo -e "  ${GREEN}✓ Cron jobs removed${NC}"
 }
 # ── Cleanup steps ─────────────────────────────────────────────────────────────
 step_docker_build_cache() {
  log_header "Docker Build Cache"
  if ! docker info &>/dev/null 2>&1; then
    log_step SKIP "Docker not running — skipping build cache prune"
    return
  fi
  local cache_size
  cache_size=$(docker system df 2>/dev/null | awk '/^Build Cache/ {print $3}' || echo "?")
  run_cmd SAFE "Prune Docker build cache (currently $cache_size)" \
    docker builder prune -f
 }
 step_docker_system_prune() {
  # Removes stopped containers, unused networks, dangling images ONLY
  # Does NOT remove images used by any existing container
  log_header "Docker System Prune (dangling only)"
  if ! docker info &>/dev/null 2>&1; then
    log_step SKIP "Docker not running"
    return
  fi
  run_cmd SAFE "Remove stopped containers, unused networks, dangling images" \
    docker system prune -f
 }
 step_docker_crash_loop_check() {
  log_header "Crash Loop Check"
  if ! docker info &>/dev/null 2>&1; then return; fi
  local looping=()
  while IFS=$'\t' read -r name restarts; do
    [[ -z "$name" || "$name" == "NAMES" ]] && continue
    restarts="${restarts:-0}"
    if (( restarts >= 20 )); then looping+=("$name(${restarts}x)"); fi
  done < <(docker ps -a --format $'{{.Names}}\t{{.RestartCount}}' 2>/dev/null || true)
  if (( ${#looping[@]} > 0 )); then
    echo -e "  ${RED}${BOLD}⚠  CRASH LOOPS DETECTED — manual fix required:${NC}"
    for c in "${looping[@]}"; do
      echo -e "  ${RED}→${NC} $c"
      echo -e "     ${DIM}fix: docker logs ${c%%(*)} | tail -20${NC}"
      echo -e "     ${DIM}stop loop: docker update --restart=no ${c%%(*)}${NC}"
    done
    log "[WARN] crash-looping containers: ${looping[*]}"
  else
    log_step SAFE "No crash-looping containers"
  fi
 }
 step_journal() {
  log_header "Journal Logs"
  run_cmd SAFE "Vacuum journal to 200MB" \
    journalctl --vacuum-size=200M
  run_cmd SAFE "Vacuum journal older than 7 days" \
    journalctl --vacuum-time=7d
 }
 step_apt_cache() {
  log_header "APT Cache"
  run_cmd SAFE "Clean apt package cache" \
    apt-get clean
 }
 step_npm_cache() {
  log_header "NPM Cache"
  if command -v npm &>/dev/null; then
    run_cmd SAFE "Clean npm cache" \
      npm cache clean --force
  fi
 }
 step_next_cache() {
  log_header ".next/cache Directories"
  # Only delete .next/cache (webpack/babel/tsbuild cache), NOT .next/standalone (prod build)
  local count=0
  while IFS= read -r cache_dir; do
    log_step SAFE "Remove $cache_dir"
    if ! $DRY_RUN; then rm -rf "$cache_dir"; fi
    (( count++ ))
  done < <(
    find /opt/bytelyst -name ".next" -maxdepth 7 -type d 2>/dev/null \
      | while read -r d; do
          [[ -d "$d/cache" ]] && echo "$d/cache"
        done
  )
  if (( count == 0 )); then log_step SKIP "No .next/cache dirs found"; fi
 }
 step_pnpm_store() {
  log_header "PNPM Store"
  if command -v pnpm &>/dev/null; then
    run_cmd SAFE "Prune unreferenced packages from pnpm store" \
      pnpm store prune
  else
    log_step SKIP "pnpm not found"
  fi
 }
 step_old_logs() {
  log_header "Old Log Files"
  # Compress any uncompressed .1 rotations that logrotate missed
  local count=0
  for f in /var/log/syslog.1 /var/log/kern.log.1 /var/log/ufw.log.1; do
    if [[ -f "$f" && ! -f "${f}.gz" ]]; then
      run_cmd SAFE "Compress $f" gzip -9 "$f"
      (( count++ ))
    fi
  done
  # Remove log rotations older than 30 days
  while IFS= read -r old_log; do
    run_cmd CAREFUL "Remove old log: $old_log" rm -f "$old_log"
  done < <(find /var/log -name "*.gz" -mtime +30 -type f 2>/dev/null || true)
  if (( count == 0 )); then log_step SKIP "No uncompressed rotations to compress"; fi
 }
 step_hold_cleanup() {
  log_header "HOLD Archived Projects"
  # node_modules in HOLD are safe to delete — code stays, can be reinstalled
  local total_freed=0
  local found=0
  while IFS= read -r nm; do
    local size
    size=$(du -sm "$nm" 2>/dev/null | cut -f1 || echo "0")
    run_cmd CAREFUL "Delete archived node_modules: $nm (~${size}MB)" rm -rf "$nm"
    total_freed=$(( total_freed + size ))
    (( found++ ))
  done < <(
    find /opt/bytelyst/HOLD -name "node_modules" -maxdepth 4 -type d 2>/dev/null || true
  )
  if (( found == 0 )); then
    log_step SKIP "HOLD node_modules already clean"
  else
    log "[INFO] Freed ~${total_freed}MB from HOLD node_modules"
  fi
  # .next build artifacts in HOLD
  while IFS= read -r next_dir; do
    run_cmd CAREFUL "Delete archived .next: $next_dir" rm -rf "$next_dir"
  done < <(
    find /opt/bytelyst/HOLD -name ".next" -maxdepth 6 -type d 2>/dev/null || true
  )
 }
 # ── Main ─────────────────────────────────────────────────────────────────────
 # Handle special modes first (no root needed for these)
 if $INSTALL_CRON;   then require_root; do_install_cron;   exit 0; fi
 if $UNINSTALL_CRON; then require_root; do_uninstall_cron; exit 0; fi
 require_root
 if ! $QUIET; then
  if $FULL_MODE; then MODE="FULL"; else MODE="STANDARD"; fi
  if $DRY_RUN;  then DRY=" (DRY-RUN)"; else DRY=""; fi
  echo -e "\n${BOLD}VM Cleanup — $(hostname) — ${MODE}${DRY}${NC}"
  echo -e "${DIM}$(date -u '+%Y-%m-%d %H:%M UTC')${NC}"
 fi
 if $FULL_MODE; then _mode="full"; else _mode="standard"; fi
 log "[START] mode=${_mode} dry=$DRY_RUN"
 record_disk_before
 # ── WEEKLY (always run) ──────────────────────────────────────────────────────
 step_docker_build_cache
 step_docker_crash_loop_check
 step_journal
 step_apt_cache
 step_npm_cache
 step_next_cache
 # ── MONTHLY (only with --full) ───────────────────────────────────────────────
 if $FULL_MODE; then
  step_docker_system_prune
  step_pnpm_store
  step_old_logs
  step_hold_cleanup
 fi
 # ── Final report ─────────────────────────────────────────────────────────────
 report_disk_delta
 if ! $QUIET; then
  echo -e "\n${GREEN}${BOLD}✓ Cleanup complete${NC}"
  echo -e "  Log: ${CYAN}tail -50 $LOG_FILE${NC}"
  if [[ -f "$HEALTH_CHECK" ]]; then
    echo ""
    echo -e "${DIM}Running health check…${NC}"
    bash "$HEALTH_CHECK" || true
  fi
 fi
 log "[END] cleanup complete"
--- a/scripts/VMs/HostingerVM/vm-health-check.sh
+++ b/scripts/VMs/HostingerVM/vm-health-check.sh
@ -0,0 +1,337 @@
 #!/usr/bin/env bash
 # =============================================================================
 # vm-health-check.sh — Hostinger VM Health Check (READ-ONLY)
 #
 # Checks disk, memory, load, swap, and Docker container health.
 # Prints a colour-coded report. Exits non-zero if any threshold is exceeded
 # so it can drive cron alerts or CI gates.
 #
 # Usage:
 #   bash vm-health-check.sh              # interactive report
 #   bash vm-health-check.sh --quiet      # only print problems (exit 1 if any)
 #   bash vm-health-check.sh --json       # machine-readable JSON output
 #   bash vm-health-check.sh --notify     # send Telegram alert on WARNING/CRITICAL
 #
 # Exit codes:
 #   0 — all green
 #   1 — at least one WARNING
 #   2 — at least one CRITICAL
 # =============================================================================
 set -Eeuo pipefail
 # shellcheck disable=SC2034
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 LOG_FILE="/var/log/vm-health-check.log"
 # ── Thresholds ──────────────────────────────────────────────────────────────
 DISK_WARN=55           # % used
 DISK_CRIT=70
 LOAD_WARN=4.0          # absolute (not per-CPU)
 LOAD_CRIT=8.0
 RAM_FREE_WARN_GB=3     # GB available
 RAM_FREE_CRIT_GB=1
 SWAP_USED_WARN_GB=1
 SWAP_USED_CRIT_GB=3
 CONTAINER_RESTART_WARN=10
 CONTAINER_RESTART_CRIT=50
 BUILD_CACHE_WARN_GB=5
 BUILD_CACHE_CRIT_GB=20
 # shellcheck disable=SC2034
 DOCKER_IMAGES_WARN_GB=15
 # shellcheck disable=SC2034
 DOCKER_IMAGES_CRIT_GB=25
 # ── Colour codes ────────────────────────────────────────────────────────────
 RED=$'\033[0;31m'
 YELLOW=$'\033[1;33m'
 GREEN=$'\033[0;32m'
 CYAN=$'\033[0;36m'
 BOLD=$'\033[1m'
 NC=$'\033[0m'
 # ── Flags ───────────────────────────────────────────────────────────────────
 QUIET=false
 JSON_MODE=false
 NOTIFY=false
 for arg in "$@"; do
  case "$arg" in
    --quiet)  QUIET=true ;;
    --json)   JSON_MODE=true ;;
    --notify) NOTIFY=true ;;
  esac
 done
 # ── State tracking ──────────────────────────────────────────────────────────
 WORST_LEVEL=0   # 0=OK, 1=WARN, 2=CRIT
 ISSUES=()
 declare -A JSON_DATA
 # ── Helpers ─────────────────────────────────────────────────────────────────
 log_to_file() {
  echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" >> "$LOG_FILE" 2>/dev/null || true
 }
 status_icon() {
  case "$1" in
    OK)   echo -e "${GREEN}✓${NC}" ;;
    WARN) echo -e "${YELLOW}⚠${NC}" ;;
    CRIT) echo -e "${RED}✗${NC}" ;;
  esac
 }
 level_int() {
  case "$1" in OK) echo 0 ;; WARN) echo 1 ;; CRIT) echo 2 ;; *) echo 0 ;; esac
 }
 record() {
  # record NAME LEVEL VALUE MESSAGE
  local name="$1" level="$2" value="$3" message="$4"
  local lvl_int
  lvl_int=$(level_int "$level")
  if (( lvl_int > WORST_LEVEL )); then WORST_LEVEL=$lvl_int; fi
  if [[ "$level" != "OK" ]]; then ISSUES+=("[$level] $message"); fi
  JSON_DATA["$name"]=$(printf '{"level":"%s","value":"%s","message":"%s"}' \
    "$level" "$value" "$message")
  if $QUIET && [[ "$level" == "OK" ]]; then return; fi
  if ! $JSON_MODE; then
    printf "  %s  %-30s %s\n" "$(status_icon "$level")" "$message" "${CYAN}($value)${NC}"
  fi
 }
 header() {
  $JSON_MODE && return
  $QUIET && return
  echo -e "\n${BOLD}${CYAN}── $1 ──────────────────────────────────────${NC}"
 }
 # ── Checks ──────────────────────────────────────────────────────────────────
 check_disk() {
  header "DISK"
  local used_pct
  used_pct=$(df / --output=pcent | tail -1 | tr -d ' %')
  local avail_gb
  avail_gb=$(df / --output=avail -BG | tail -1 | tr -d ' G')
  if   (( used_pct >= DISK_CRIT )); then record disk CRIT "${used_pct}%" "Disk ${used_pct}% used — CRITICAL (>${DISK_CRIT}%)"
  elif (( used_pct >= DISK_WARN )); then record disk WARN "${used_pct}%" "Disk ${used_pct}% used — WARNING (>${DISK_WARN}%)"
  else                                   record disk OK   "${used_pct}% used, ${avail_gb}G free" "Disk OK (${used_pct}%)"
  fi
 }
 check_load() {
  header "LOAD"
  local load1 load5 _load15
  read -r load1 load5 _load15 _ < /proc/loadavg
  local ncpu
  ncpu=$(nproc)
  # compare as integers (multiply by 10 to avoid bc dependency)
  local load1_int
  load1_int=$(echo "$load1" | awk '{printf "%d", $1 * 10}')
  local warn_int crit_int
  warn_int=$(echo "$LOAD_WARN" | awk '{printf "%d", $1 * 10}')
  crit_int=$(echo "$LOAD_CRIT" | awk '{printf "%d", $1 * 10}')
  if   (( load1_int >= crit_int )); then record load CRIT "$load1" "Load avg $load1 — CRITICAL (>${LOAD_CRIT}, ${ncpu} CPUs)"
  elif (( load1_int >= warn_int )); then record load WARN "$load1" "Load avg $load1 — WARNING (>${LOAD_WARN})"
  else                                   record load OK   "$load1 (1m) / $load5 (5m)" "Load OK ($load1)"
  fi
 }
 check_memory() {
  header "MEMORY"
  local available_kb total_kb
  available_kb=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
  total_kb=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
  local available_gb total_gb
  available_gb=$(( available_kb / 1024 / 1024 ))
  total_gb=$(( total_kb / 1024 / 1024 ))
  if   (( available_gb < RAM_FREE_CRIT_GB )); then record ram CRIT "${available_gb}G avail" "RAM available ${available_gb}G — CRITICAL (<${RAM_FREE_CRIT_GB}G)"
  elif (( available_gb < RAM_FREE_WARN_GB )); then record ram WARN "${available_gb}G avail" "RAM available ${available_gb}G — WARNING (<${RAM_FREE_WARN_GB}G)"
  else                                             record ram OK   "${available_gb}G / ${total_gb}G" "RAM OK (${available_gb}G available)"
  fi
 }
 check_swap() {
  header "SWAP"
  local swap_total_kb swap_used_kb
  swap_total_kb=$(awk '/^SwapTotal/ {print $2}' /proc/meminfo)
  swap_used_kb=$(awk '/^SwapFree/ {print $2}' /proc/meminfo)
  swap_used_kb=$(( swap_total_kb - swap_used_kb ))
  local swap_total_gb swap_used_gb
  swap_total_gb=$(( swap_total_kb / 1024 / 1024 ))
  swap_used_gb=$(( swap_used_kb / 1024 / 1024 ))
  if (( swap_total_kb == 0 )); then
    record swap CRIT "no swap" "NO SWAP configured — CRITICAL (add swapfile!)"
    return
  fi
  if   (( swap_used_gb >= SWAP_USED_CRIT_GB )); then record swap CRIT "${swap_used_gb}G used" "Swap ${swap_used_gb}G used — CRITICAL"
  elif (( swap_used_gb >= SWAP_USED_WARN_GB )); then record swap WARN "${swap_used_gb}G used" "Swap ${swap_used_gb}G used — WARNING"
  else                                               record swap OK   "${swap_used_gb}G / ${swap_total_gb}G" "Swap OK (${swap_used_gb}G used)"
  fi
 }
 check_docker_containers() {
  header "DOCKER CONTAINERS"
  if ! command -v docker &>/dev/null || ! docker info &>/dev/null 2>&1; then
    record docker_daemon WARN "not running" "Docker daemon is not running"
    return
  fi
  # Crash-looping containers
  local looping_warn=() looping_crit=()
  while IFS=$'\t' read -r name restarts; do
    [[ -z "$name" || "$name" == "NAMES" ]] && continue
    restarts="${restarts:-0}"
    if   (( restarts >= CONTAINER_RESTART_CRIT )); then looping_crit+=("$name(${restarts}x)")
    elif (( restarts >= CONTAINER_RESTART_WARN )); then looping_warn+=("$name(${restarts}x)")
    fi
  done < <(docker ps -a --format $'{{.Names}}\t{{.RestartCount}}' 2>/dev/null || true)
  if   (( ${#looping_crit[@]} > 0 )); then
    record container_loops CRIT "${looping_crit[*]}" "Crash-looping containers: ${looping_crit[*]}"
  elif (( ${#looping_warn[@]} > 0 )); then
    record container_loops WARN "${looping_warn[*]}" "Containers restarting: ${looping_warn[*]}"
  else
    record container_loops OK "0 looping" "No containers crash-looping"
  fi
  # Unhealthy containers (running but health check failing)
  local unhealthy
  unhealthy=$(docker ps --filter health=unhealthy --format '{{.Names}}' 2>/dev/null | paste -sd, || true)
  if [[ -n "$unhealthy" ]]; then
    record container_health WARN "$unhealthy" "Unhealthy containers: $unhealthy"
  else
    record container_health OK "all healthy" "All containers passing healthchecks"
  fi
 }
 check_docker_disk() {
  header "DOCKER DISK"
  if ! command -v docker &>/dev/null || ! docker info &>/dev/null 2>&1; then
    return
  fi
  # Build cache
  local cache_size_gb
  cache_size_gb=$(docker system df --format '{{.BuildCache}}' 2>/dev/null \
    | grep -oP '[0-9.]+(?=GB)' | head -1 || echo "0")
  cache_size_gb="${cache_size_gb:-0}"
  local cache_int
  cache_int=$(echo "$cache_size_gb" | awk '{printf "%d", $1}')
  if   (( cache_int >= BUILD_CACHE_CRIT_GB )); then record build_cache CRIT "${cache_size_gb}GB" "Docker build cache ${cache_size_gb}GB — CRITICAL (>${BUILD_CACHE_CRIT_GB}GB, run: docker builder prune -f)"
  elif (( cache_int >= BUILD_CACHE_WARN_GB )); then record build_cache WARN "${cache_size_gb}GB" "Docker build cache ${cache_size_gb}GB — WARNING (run: docker builder prune -f)"
  else                                              record build_cache OK   "${cache_size_gb}GB" "Build cache OK (${cache_size_gb}GB)"
  fi
  # Images total size
  local images_size
  images_size=$(docker system df 2>/dev/null | awk '/^Images/ {print $3}' || echo "?")
  record docker_images OK "$images_size" "Docker images: $images_size"
 }
 check_logs() {
  header "LOGS"
  local journal_mb
  journal_mb=$(journalctl --disk-usage 2>/dev/null \
    | grep -oP '[0-9.]+(?= M)' | head -1 || echo "0")
  journal_mb="${journal_mb:-0}"
  local syslog_mb=0
  [[ -f /var/log/syslog ]] && syslog_mb=$(du -sm /var/log/syslog 2>/dev/null | cut -f1 || echo "0")
  if   (( journal_mb > 300 )); then record journal WARN "${journal_mb}MB" "Journal ${journal_mb}MB — WARNING (run: journalctl --vacuum-size=200M)"
  else                               record journal OK   "${journal_mb}MB" "Journal OK (${journal_mb}MB)"
  fi
  if   (( syslog_mb > 100 )); then record syslog WARN "${syslog_mb}MB" "syslog ${syslog_mb}MB — WARNING"
  else                              record syslog OK   "${syslog_mb}MB" "syslog OK (${syslog_mb}MB)"
  fi
 }
 # ── Run all checks ───────────────────────────────────────────────────────────
 if ! $JSON_MODE && ! $QUIET; then
  echo -e "\n${BOLD}VM Health Check — $(hostname) — $(date -u '+%Y-%m-%d %H:%M UTC')${NC}"
 fi
 check_disk
 check_load
 check_memory
 check_swap
 check_docker_containers
 check_docker_disk
 check_logs
 # ── Summary ──────────────────────────────────────────────────────────────────
 if $JSON_MODE; then
  echo '{'
  echo "  \"timestamp\": \"$(date -u '+%Y-%m-%dT%H:%M:%SZ')\","
  echo "  \"hostname\": \"$(hostname)\","
  if   (( WORST_LEVEL >= 2 )); then _overall='"CRIT"'
  elif (( WORST_LEVEL == 1 )); then _overall='"WARN"'
  else                               _overall='"OK"'
  fi
  echo "  \"overall\": ${_overall},"
  echo "  \"checks\": {"
  local_keys=("${!JSON_DATA[@]}")
  for i in "${!local_keys[@]}"; do
    k="${local_keys[$i]}"
    if [[ $i -lt $(( ${#local_keys[@]} - 1 )) ]]; then comma=","; else comma=""; fi
    echo "    \"$k\": ${JSON_DATA[$k]}$comma"
  done
  echo "  }"
  echo '}'
 else
  echo ""
  if (( WORST_LEVEL == 0 )); then
    echo -e "  ${GREEN}${BOLD}✓ All checks passed${NC}"
  elif (( WORST_LEVEL == 1 )); then
    echo -e "  ${YELLOW}${BOLD}⚠ ${#ISSUES[@]} warning(s):${NC}"
    for issue in "${ISSUES[@]}"; do echo -e "    ${YELLOW}→${NC} $issue"; done
  else
    echo -e "  ${RED}${BOLD}✗ ${#ISSUES[@]} issue(s) — action required:${NC}"
    for issue in "${ISSUES[@]}"; do echo -e "    ${RED}→${NC} $issue"; done
  fi
  echo ""
 fi
 # ── Telegram notification ─────────────────────────────────────────────────
 if $NOTIFY && (( WORST_LEVEL > 0 )); then
  TOKEN_FILE="${HERMES_HOME:-/root/.hermes}/.env"
  TELEGRAM_TOKEN=""
  TELEGRAM_CHAT_ID=""
  if [[ -f "$TOKEN_FILE" ]]; then
    TELEGRAM_TOKEN=$(grep -oP '(?<=TELEGRAM_BOT_TOKEN=)\S+' "$TOKEN_FILE" || true)
    TELEGRAM_CHAT_ID=$(grep -oP '(?<=TELEGRAM_CHAT_ID=)\S+' "$TOKEN_FILE" || true)
  fi
  if [[ -n "$TELEGRAM_TOKEN" && -n "$TELEGRAM_CHAT_ID" ]]; then
    SEVERITY=$([[ $WORST_LEVEL -ge 2 ]] && echo "🚨 CRITICAL" || echo "⚠️ WARNING")
    MSG="$SEVERITY — $(hostname) VM health check
 $(date -u '+%Y-%m-%d %H:%M UTC')
 $(printf '%s\n' "${ISSUES[@]}")"
    curl -sf -X POST "https://api.telegram.org/bot${TELEGRAM_TOKEN}/sendMessage" \
      -d chat_id="$TELEGRAM_CHAT_ID" \
      -d text="$MSG" > /dev/null || true
  fi
 fi
 # ── Log result ───────────────────────────────────────────────────────────────
 if   (( WORST_LEVEL >= 2 )); then RESULT_STR="CRIT"
 elif (( WORST_LEVEL == 1 )); then RESULT_STR="WARN"
 else                               RESULT_STR="OK"
 fi
 log_to_file "health-check result=$RESULT_STR issues=${#ISSUES[@]}"
 exit "$WORST_LEVEL"