- vm-health-check.sh: read-only checks for disk, load, RAM, swap, Docker containers (crash-loops + healthchecks), build cache, journal. Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT. - vm-cleanup.sh: safe periodic cleanup. Default (weekly): build cache, journal, apt, npm, .next/cache. --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup. --dry-run, --install-cron, --uninstall-cron. Logs to /var/log/vm-cleanup.log. Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.2 KiB
Hostinger VM — Maintenance & Incident Reference
VM: srv1491630.hstgr.cloud · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk
Key services: hermes-gateway, ollama, Docker (~40 containers), learning_ai_common_plat stack
Quick-start for day-to-day ops
# Check VM health (read-only, safe any time)
bash scripts/VMs/HostingerVM/vm-health-check.sh
# Weekly safe cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh
# Monthly deeper cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh --full
# Cron setup (run once)
bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
See CRON_SETUP.md for full details.
Incident Report — Load Average 1305 (2026-05-26)
What happened
The VM became completely unresponsive. Load average reached 1305 (normal < 4 on 4 CPUs).
load average: 1305.54, 1339.23, 1302.41
RAM: 13 / 15 GB used, ZERO swap configured
Single root cause: one broken Docker container crash-looped 1,336 times over ~25 hours.
Container: learning_ai_common_plat-admin-web-1
Error: Cannot find module '/app/server.js'
Restart policy: unless-stopped (no backoff limit, retries forever)
Each restart spawned ~3 OS processes:
containerd-shim-runc-v2- veth network interface creation
networkctlcall for the new interface
With 1,336 restarts × ~3 procs = ~4,000 processes — the kernel scheduler thrashed → load 1305.
Why the container was broken
The admin-web Docker image had no server.js because its Next.js build failed silently. Three bugs stacked:
| Bug | File | Detail |
|---|---|---|
| Missing build secret | docker-compose.ecosystem.yml |
admin-web service was missing <<: *product-build anchor, so GITEA_NPM_TOKEN was never passed as a BuildKit secret → pnpm install of @bytelyst/* packages failed |
| Missing COPY step | dashboards/admin-web/Dockerfile |
tsconfig.base.json (monorepo root) was not copied into the build context → tsc couldn't find it → build failed |
| Wrong pnpm flag | dashboards/admin-web/Dockerfile |
--legacy-peer-deps is an npm flag, not valid in pnpm 10 → install step exited early |
Because the build stage failed, COPY --from=builder .next/standalone ./ copied nothing, leaving the runner stage with an empty /app — no server.js.
Timeline
| Time (UTC) | Event |
|---|---|
| 2026-05-26 04:43 | VM booted, Docker started |
| 2026-05-26 04:56 | admin-web first restart (count=1) |
| 2026-05-26 ~05:00–06:07 | Load climbs steadily, RAM fills |
| 2026-05-26 ~ongoing | 1,336 restarts over 25 hours |
| 2026-05-27 06:07 | VM rebooted (load avg recorded: 1305) |
| 2026-05-27 06:28 | Diagnosis session started (load: 0.55 after reboot) |
| 2026-05-27 08:20 | All fixes applied, cleanup complete |
Secondary problems found
| Issue | Detail |
|---|---|
| No swap | Zero swap configured — OOM kills inevitable under memory pressure |
| 84 GB Docker build cache | Never pruned; Next.js/TSC builds accumulate enormous layer cache |
| 12 GB HOLD node_modules | Archived projects in /opt/bytelyst/HOLD had deps never cleaned up |
| ~3 GB .next/cache | Build-time caches in active and HOLD repos |
| 381 MB uncompressed logs | syslog.1, kern.log.1 not compressed; no size/retention limits on journal |
| No crash-loop detection | Nothing alerting on containers restarting > N times |
Fixes Applied (2026-05-27)
1. Crash loop — stopped
Patched /var/lib/docker/containers/2219091e.../hostconfig.json while Docker was stopped:
"RestartPolicy": {"Name": "no", "MaximumRetryCount": 0}
Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling.
2. Swap — added
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
sysctl vm.swappiness=10
echo 'vm.swappiness=10' >> /etc/sysctl.conf
3. Disk — 79 GB reclaimed (70% → 27%)
| Action | Freed |
|---|---|
docker builder prune -f |
84 GB |
docker system prune -f |
107 MB |
| HOLD node_modules deleted | ~12 GB |
HOLD .next build caches |
~1.2 GB |
Active .next/cache dirs |
~2.4 GB |
| Old Claude CLI versions | ~940 MB |
| npm cache clean | ~1.8 GB |
| Journal vacuum | ~220 MB |
| apt clean | ~280 MB |
4. Log management
/etc/systemd/journald.conf.d/size-limits.conf:
[Journal]
SystemMaxUse=200M
SystemKeepFree=1G
MaxRetentionSec=7day
MaxFileSec=1day
/etc/rsyslog.d/20-ufw-filter.conf:
:msg, contains, "[UFW BLOCK]" stop
/etc/logrotate.d/rsyslog-custom: daily rotation, 7-day retention, compress-on-rotate.
5. Dockerfile fixes (ready, not yet deployed)
docker-compose.ecosystem.yml — added <<: *product-build to admin-web build section
dashboards/admin-web/Dockerfile — added tsconfig.base.json to COPY, removed --legacy-peer-deps
Deploying admin-web (when ready)
cd /opt/bytelyst/learning_ai_common_plat
GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
build admin-web
# Verify the standalone build was produced:
docker run --rm --entrypoint ls \
learning_ai_common_plat-admin-web:latest /app | grep server.js
# Start it:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
up -d admin-web
The container's restart policy will be set by the compose file (unless-stopped). Once the image is healthy, this is safe.
Ongoing health targets
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
Disk usage / |
< 55% | 55–70% | > 70% |
| Load average | < 4.0 | 4–8 | > 8 |
| Available RAM | > 3 GB | 1–3 GB | < 1 GB |
| Swap used | < 1 GB | 1–3 GB | > 3 GB |
| Container restart count | < 5 | 5–20 | > 20 |
| Docker build cache | < 5 GB | 5–20 GB | > 20 GB |
Reference: safe cleanup commands
# Always safe (just prunes unreferenced build layers)
docker builder prune -f
# Safe: removes stopped containers, unused networks, dangling images only
docker system prune -f
# Safe: removes packages not referenced by any installed node_modules
pnpm store prune
# Safe: vacuum journal to size limit
journalctl --vacuum-size=200M
# Safe: clear apt cache
apt-get clean
# Safe: clear npm cache
npm cache clean --force
# Careful: removes ALL images not used by a running container (rebuilds needed)
docker image prune -a -f
Crash-loop detection (manual check)
# Show containers that have restarted more than 10 times
docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \
| awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}'
# Show container logs for any that are restarting
docker events --filter event=restart --since 1h
The vm-health-check.sh script runs these checks automatically.
Related scripts
| Script | Purpose |
|---|---|
scripts/VMs/HostingerVM/vm-health-check.sh |
Daily read-only health check + alerts |
scripts/VMs/HostingerVM/vm-cleanup.sh |
Periodic safe cleanup |
scripts/VMs/HostingerVM/CRON_SETUP.md |
Cron wiring |
scripts/ubuntu-vm-security-update.sh |
Security patching |
scripts/VMs/HostingerVM/login.sh |
SSH into the VM |