Hermes VM 0a2d303f93 add HostingerVM health-check and cleanup scripts

- vm-health-check.sh: read-only checks for disk, load, RAM, swap,
  Docker containers (crash-loops + healthchecks), build cache, journal.
  Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT.

- vm-cleanup.sh: safe periodic cleanup.
  Default (weekly): build cache, journal, apt, npm, .next/cache.
  --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup.
  --dry-run, --install-cron, --uninstall-cron.
  Logs to /var/log/vm-cleanup.log.

Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-27 18:53:20 +00:00

7.2 KiB

Raw Blame History

Hostinger VM — Maintenance & Incident Reference

VM: srv1491630.hstgr.cloud · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk Key services: hermes-gateway, ollama, Docker (~40 containers), learning_ai_common_plat stack

Quick-start for day-to-day ops

# Check VM health (read-only, safe any time)
bash scripts/VMs/HostingerVM/vm-health-check.sh

# Weekly safe cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh

# Monthly deeper cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh --full

# Cron setup (run once)
bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron

See CRON_SETUP.md for full details.

Incident Report — Load Average 1305 (2026-05-26)

What happened

The VM became completely unresponsive. Load average reached 1305 (normal < 4 on 4 CPUs).

load average: 1305.54, 1339.23, 1302.41
RAM: 13 / 15 GB used, ZERO swap configured

Single root cause: one broken Docker container crash-looped 1,336 times over ~25 hours.

Container: learning_ai_common_plat-admin-web-1 Error: Cannot find module '/app/server.js' Restart policy: unless-stopped (no backoff limit, retries forever)

Each restart spawned ~3 OS processes:

containerd-shim-runc-v2
veth network interface creation
networkctl call for the new interface

With 1,336 restarts × ~3 procs = ~4,000 processes — the kernel scheduler thrashed → load 1305.

Why the container was broken

The admin-web Docker image had no server.js because its Next.js build failed silently. Three bugs stacked:

Bug	File	Detail
Missing build secret	`docker-compose.ecosystem.yml`	`admin-web` service was missing `<<: product-build` anchor, so `GITEA_NPM_TOKEN` was never passed as a BuildKit secret → `pnpm install` of `@bytelyst/` packages failed
Missing COPY step	`dashboards/admin-web/Dockerfile`	`tsconfig.base.json` (monorepo root) was not copied into the build context → `tsc` couldn't find it → build failed
Wrong pnpm flag	`dashboards/admin-web/Dockerfile`	`--legacy-peer-deps` is an npm flag, not valid in pnpm 10 → install step exited early

Because the build stage failed, COPY --from=builder .next/standalone ./ copied nothing, leaving the runner stage with an empty /app — no server.js.

Timeline

Time (UTC)	Event
2026-05-26 04:43	VM booted, Docker started
2026-05-26 04:56	`admin-web` first restart (count=1)
2026-05-26 ~05:00–06:07	Load climbs steadily, RAM fills
2026-05-26 ~ongoing	1,336 restarts over 25 hours
2026-05-27 06:07	VM rebooted (load avg recorded: 1305)
2026-05-27 06:28	Diagnosis session started (load: 0.55 after reboot)
2026-05-27 08:20	All fixes applied, cleanup complete

Secondary problems found

Issue	Detail
No swap	Zero swap configured — OOM kills inevitable under memory pressure
84 GB Docker build cache	Never pruned; Next.js/TSC builds accumulate enormous layer cache
12 GB HOLD node_modules	Archived projects in `/opt/bytelyst/HOLD` had deps never cleaned up
~3 GB .next/cache	Build-time caches in active and HOLD repos
381 MB uncompressed logs	`syslog.1`, `kern.log.1` not compressed; no size/retention limits on journal
No crash-loop detection	Nothing alerting on containers restarting > N times

Fixes Applied (2026-05-27)

1. Crash loop — stopped

Patched /var/lib/docker/containers/2219091e.../hostconfig.json while Docker was stopped:

"RestartPolicy": {"Name": "no", "MaximumRetryCount": 0}

Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling.

2. Swap — added

fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
sysctl vm.swappiness=10
echo 'vm.swappiness=10' >> /etc/sysctl.conf

3. Disk — 79 GB reclaimed (70% → 27%)

Action	Freed
`docker builder prune -f`	84 GB
`docker system prune -f`	107 MB
HOLD node_modules deleted	~12 GB
HOLD `.next` build caches	~1.2 GB
Active `.next/cache` dirs	~2.4 GB
Old Claude CLI versions	~940 MB
npm cache clean	~1.8 GB
Journal vacuum	~220 MB
apt clean	~280 MB

4. Log management

/etc/systemd/journald.conf.d/size-limits.conf:

[Journal]
SystemMaxUse=200M
SystemKeepFree=1G
MaxRetentionSec=7day
MaxFileSec=1day

/etc/rsyslog.d/20-ufw-filter.conf:

:msg, contains, "[UFW BLOCK]" stop

/etc/logrotate.d/rsyslog-custom: daily rotation, 7-day retention, compress-on-rotate.

5. Dockerfile fixes (ready, not yet deployed)

docker-compose.ecosystem.yml — added <<: *product-build to admin-web build section dashboards/admin-web/Dockerfile — added tsconfig.base.json to COPY, removed --legacy-peer-deps

Deploying admin-web (when ready)

cd /opt/bytelyst/learning_ai_common_plat
GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \
  docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
  build admin-web

# Verify the standalone build was produced:
docker run --rm --entrypoint ls \
  learning_ai_common_plat-admin-web:latest /app | grep server.js

# Start it:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
  up -d admin-web

The container's restart policy will be set by the compose file (unless-stopped). Once the image is healthy, this is safe.

Ongoing health targets

Metric	Healthy	Warning	Critical
Disk usage `/`	< 55%	55–70%	> 70%
Load average	< 4.0	4–8	> 8
Available RAM	> 3 GB	1–3 GB	< 1 GB
Swap used	< 1 GB	1–3 GB	> 3 GB
Container restart count	< 5	5–20	> 20
Docker build cache	< 5 GB	5–20 GB	> 20 GB

Reference: safe cleanup commands

# Always safe (just prunes unreferenced build layers)
docker builder prune -f

# Safe: removes stopped containers, unused networks, dangling images only
docker system prune -f

# Safe: removes packages not referenced by any installed node_modules
pnpm store prune

# Safe: vacuum journal to size limit
journalctl --vacuum-size=200M

# Safe: clear apt cache
apt-get clean

# Safe: clear npm cache
npm cache clean --force

# Careful: removes ALL images not used by a running container (rebuilds needed)
docker image prune -a -f

Crash-loop detection (manual check)

# Show containers that have restarted more than 10 times
docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \
  | awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}'

# Show container logs for any that are restarting
docker events --filter event=restart --since 1h

The vm-health-check.sh script runs these checks automatically.

Script	Purpose
`scripts/VMs/HostingerVM/vm-health-check.sh`	Daily read-only health check + alerts
`scripts/VMs/HostingerVM/vm-cleanup.sh`	Periodic safe cleanup
`scripts/VMs/HostingerVM/CRON_SETUP.md`	Cron wiring
`scripts/ubuntu-vm-security-update.sh`	Security patching
`scripts/VMs/HostingerVM/login.sh`	SSH into the VM

7.2 KiB Raw Blame History Unescape Escape