# Hostinger VM — Maintenance & Incident Reference **VM:** `srv1491630.hstgr.cloud` · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk **Key services:** `hermes-gateway`, `ollama`, Docker (~40 containers), `learning_ai_common_plat` stack --- ## Quick-start for day-to-day ops ```bash # Check VM health (read-only, safe any time) bash scripts/VMs/HostingerVM/vm-health-check.sh # Weekly safe cleanup bash scripts/VMs/HostingerVM/vm-cleanup.sh # Monthly deeper cleanup bash scripts/VMs/HostingerVM/vm-cleanup.sh --full # Cron setup (run once) bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron ``` See [`CRON_SETUP.md`](../scripts/VMs/HostingerVM/CRON_SETUP.md) for full details. --- ## Incident Report — Load Average 1305 (2026-05-26) ### What happened The VM became completely unresponsive. Load average reached **1305** (normal < 4 on 4 CPUs). ``` load average: 1305.54, 1339.23, 1302.41 RAM: 13 / 15 GB used, ZERO swap configured ``` **Single root cause:** one broken Docker container crash-looped **1,336 times** over ~25 hours. Container: `learning_ai_common_plat-admin-web-1` Error: `Cannot find module '/app/server.js'` Restart policy: `unless-stopped` (no backoff limit, retries forever) Each restart spawned ~3 OS processes: - `containerd-shim-runc-v2` - veth network interface creation - `networkctl` call for the new interface With 1,336 restarts × ~3 procs = **~4,000 processes** — the kernel scheduler thrashed → load 1305. ### Why the container was broken The `admin-web` Docker image had no `server.js` because its Next.js build failed silently. Three bugs stacked: | Bug | File | Detail | |-----|------|--------| | Missing build secret | `docker-compose.ecosystem.yml` | `admin-web` service was missing `<<: *product-build` anchor, so `GITEA_NPM_TOKEN` was never passed as a BuildKit secret → `pnpm install` of `@bytelyst/*` packages failed | | Missing COPY step | `dashboards/admin-web/Dockerfile` | `tsconfig.base.json` (monorepo root) was not copied into the build context → `tsc` couldn't find it → build failed | | Wrong pnpm flag | `dashboards/admin-web/Dockerfile` | `--legacy-peer-deps` is an npm flag, not valid in pnpm 10 → install step exited early | Because the build stage failed, `COPY --from=builder .next/standalone ./` copied nothing, leaving the runner stage with an empty `/app` — no `server.js`. ### Timeline | Time (UTC) | Event | |---|---| | 2026-05-26 04:43 | VM booted, Docker started | | 2026-05-26 04:56 | `admin-web` first restart (count=1) | | 2026-05-26 ~05:00–06:07 | Load climbs steadily, RAM fills | | 2026-05-26 ~ongoing | 1,336 restarts over 25 hours | | 2026-05-27 06:07 | VM rebooted (load avg recorded: 1305) | | 2026-05-27 06:28 | Diagnosis session started (load: 0.55 after reboot) | | 2026-05-27 08:20 | All fixes applied, cleanup complete | ### Secondary problems found | Issue | Detail | |---|---| | **No swap** | Zero swap configured — OOM kills inevitable under memory pressure | | **84 GB Docker build cache** | Never pruned; Next.js/TSC builds accumulate enormous layer cache | | **12 GB HOLD node_modules** | Archived projects in `/opt/bytelyst/HOLD` had deps never cleaned up | | **~3 GB .next/cache** | Build-time caches in active and HOLD repos | | **381 MB uncompressed logs** | `syslog.1`, `kern.log.1` not compressed; no size/retention limits on journal | | **No crash-loop detection** | Nothing alerting on containers restarting > N times | --- ## Fixes Applied (2026-05-27) ### 1. Crash loop — stopped Patched `/var/lib/docker/containers/2219091e.../hostconfig.json` while Docker was stopped: ```json "RestartPolicy": {"Name": "no", "MaximumRetryCount": 0} ``` Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling. ### 2. Swap — added ```bash fallocate -l 4G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile echo '/swapfile none swap sw 0 0' >> /etc/fstab sysctl vm.swappiness=10 echo 'vm.swappiness=10' >> /etc/sysctl.conf ``` ### 3. Disk — 79 GB reclaimed (70% → 27%) | Action | Freed | |---|---| | `docker builder prune -f` | 84 GB | | `docker system prune -f` | 107 MB | | HOLD node_modules deleted | ~12 GB | | HOLD `.next` build caches | ~1.2 GB | | Active `.next/cache` dirs | ~2.4 GB | | Old Claude CLI versions | ~940 MB | | npm cache clean | ~1.8 GB | | Journal vacuum | ~220 MB | | apt clean | ~280 MB | ### 4. Log management `/etc/systemd/journald.conf.d/size-limits.conf`: ```ini [Journal] SystemMaxUse=200M SystemKeepFree=1G MaxRetentionSec=7day MaxFileSec=1day ``` `/etc/rsyslog.d/20-ufw-filter.conf`: ``` :msg, contains, "[UFW BLOCK]" stop ``` `/etc/logrotate.d/rsyslog-custom`: daily rotation, 7-day retention, compress-on-rotate. ### 5. Dockerfile fixes (ready, not yet deployed) `docker-compose.ecosystem.yml` — added `<<: *product-build` to `admin-web` build section `dashboards/admin-web/Dockerfile` — added `tsconfig.base.json` to COPY, removed `--legacy-peer-deps` --- ## Deploying admin-web (when ready) ```bash cd /opt/bytelyst/learning_ai_common_plat GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \ docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \ build admin-web # Verify the standalone build was produced: docker run --rm --entrypoint ls \ learning_ai_common_plat-admin-web:latest /app | grep server.js # Start it: docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \ up -d admin-web ``` The container's restart policy will be set by the compose file (`unless-stopped`). Once the image is healthy, this is safe. --- ## Ongoing health targets | Metric | Healthy | Warning | Critical | |---|---|---|---| | Disk usage `/` | < 55% | 55–70% | > 70% | | Load average | < 4.0 | 4–8 | > 8 | | Available RAM | > 3 GB | 1–3 GB | < 1 GB | | Swap used | < 1 GB | 1–3 GB | > 3 GB | | Container restart count | < 5 | 5–20 | > 20 | | Docker build cache | < 5 GB | 5–20 GB | > 20 GB | --- ## Reference: safe cleanup commands ```bash # Always safe (just prunes unreferenced build layers) docker builder prune -f # Safe: removes stopped containers, unused networks, dangling images only docker system prune -f # Safe: removes packages not referenced by any installed node_modules pnpm store prune # Safe: vacuum journal to size limit journalctl --vacuum-size=200M # Safe: clear apt cache apt-get clean # Safe: clear npm cache npm cache clean --force # Careful: removes ALL images not used by a running container (rebuilds needed) docker image prune -a -f ``` --- ## Crash-loop detection (manual check) ```bash # Show containers that have restarted more than 10 times docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \ | awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}' # Show container logs for any that are restarting docker events --filter event=restart --since 1h ``` The `vm-health-check.sh` script runs these checks automatically. --- ## Related scripts | Script | Purpose | |---|---| | `scripts/VMs/HostingerVM/vm-health-check.sh` | Daily read-only health check + alerts | | `scripts/VMs/HostingerVM/vm-cleanup.sh` | Periodic safe cleanup | | `scripts/VMs/HostingerVM/CRON_SETUP.md` | Cron wiring | | `scripts/ubuntu-vm-security-update.sh` | Security patching | | `scripts/VMs/HostingerVM/login.sh` | SSH into the VM |