bytelyst-devops-tools/docs/hostinger-vm-maintenance.md
Hermes VM 0a2d303f93 add HostingerVM health-check and cleanup scripts
- vm-health-check.sh: read-only checks for disk, load, RAM, swap,
  Docker containers (crash-loops + healthchecks), build cache, journal.
  Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT.

- vm-cleanup.sh: safe periodic cleanup.
  Default (weekly): build cache, journal, apt, npm, .next/cache.
  --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup.
  --dry-run, --install-cron, --uninstall-cron.
  Logs to /var/log/vm-cleanup.log.

Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00

7.2 KiB
Raw Blame History

Hostinger VM — Maintenance & Incident Reference

VM: srv1491630.hstgr.cloud · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk Key services: hermes-gateway, ollama, Docker (~40 containers), learning_ai_common_plat stack


Quick-start for day-to-day ops

# Check VM health (read-only, safe any time)
bash scripts/VMs/HostingerVM/vm-health-check.sh

# Weekly safe cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh

# Monthly deeper cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh --full

# Cron setup (run once)
bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron

See CRON_SETUP.md for full details.


Incident Report — Load Average 1305 (2026-05-26)

What happened

The VM became completely unresponsive. Load average reached 1305 (normal < 4 on 4 CPUs).

load average: 1305.54, 1339.23, 1302.41
RAM: 13 / 15 GB used, ZERO swap configured

Single root cause: one broken Docker container crash-looped 1,336 times over ~25 hours.

Container: learning_ai_common_plat-admin-web-1 Error: Cannot find module '/app/server.js' Restart policy: unless-stopped (no backoff limit, retries forever)

Each restart spawned ~3 OS processes:

  • containerd-shim-runc-v2
  • veth network interface creation
  • networkctl call for the new interface

With 1,336 restarts × ~3 procs = ~4,000 processes — the kernel scheduler thrashed → load 1305.

Why the container was broken

The admin-web Docker image had no server.js because its Next.js build failed silently. Three bugs stacked:

Bug File Detail
Missing build secret docker-compose.ecosystem.yml admin-web service was missing <<: *product-build anchor, so GITEA_NPM_TOKEN was never passed as a BuildKit secret → pnpm install of @bytelyst/* packages failed
Missing COPY step dashboards/admin-web/Dockerfile tsconfig.base.json (monorepo root) was not copied into the build context → tsc couldn't find it → build failed
Wrong pnpm flag dashboards/admin-web/Dockerfile --legacy-peer-deps is an npm flag, not valid in pnpm 10 → install step exited early

Because the build stage failed, COPY --from=builder .next/standalone ./ copied nothing, leaving the runner stage with an empty /app — no server.js.

Timeline

Time (UTC) Event
2026-05-26 04:43 VM booted, Docker started
2026-05-26 04:56 admin-web first restart (count=1)
2026-05-26 ~05:0006:07 Load climbs steadily, RAM fills
2026-05-26 ~ongoing 1,336 restarts over 25 hours
2026-05-27 06:07 VM rebooted (load avg recorded: 1305)
2026-05-27 06:28 Diagnosis session started (load: 0.55 after reboot)
2026-05-27 08:20 All fixes applied, cleanup complete

Secondary problems found

Issue Detail
No swap Zero swap configured — OOM kills inevitable under memory pressure
84 GB Docker build cache Never pruned; Next.js/TSC builds accumulate enormous layer cache
12 GB HOLD node_modules Archived projects in /opt/bytelyst/HOLD had deps never cleaned up
~3 GB .next/cache Build-time caches in active and HOLD repos
381 MB uncompressed logs syslog.1, kern.log.1 not compressed; no size/retention limits on journal
No crash-loop detection Nothing alerting on containers restarting > N times

Fixes Applied (2026-05-27)

1. Crash loop — stopped

Patched /var/lib/docker/containers/2219091e.../hostconfig.json while Docker was stopped:

"RestartPolicy": {"Name": "no", "MaximumRetryCount": 0}

Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling.

2. Swap — added

fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
sysctl vm.swappiness=10
echo 'vm.swappiness=10' >> /etc/sysctl.conf

3. Disk — 79 GB reclaimed (70% → 27%)

Action Freed
docker builder prune -f 84 GB
docker system prune -f 107 MB
HOLD node_modules deleted ~12 GB
HOLD .next build caches ~1.2 GB
Active .next/cache dirs ~2.4 GB
Old Claude CLI versions ~940 MB
npm cache clean ~1.8 GB
Journal vacuum ~220 MB
apt clean ~280 MB

4. Log management

/etc/systemd/journald.conf.d/size-limits.conf:

[Journal]
SystemMaxUse=200M
SystemKeepFree=1G
MaxRetentionSec=7day
MaxFileSec=1day

/etc/rsyslog.d/20-ufw-filter.conf:

:msg, contains, "[UFW BLOCK]" stop

/etc/logrotate.d/rsyslog-custom: daily rotation, 7-day retention, compress-on-rotate.

5. Dockerfile fixes (ready, not yet deployed)

docker-compose.ecosystem.yml — added <<: *product-build to admin-web build section dashboards/admin-web/Dockerfile — added tsconfig.base.json to COPY, removed --legacy-peer-deps


Deploying admin-web (when ready)

cd /opt/bytelyst/learning_ai_common_plat
GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \
  docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
  build admin-web

# Verify the standalone build was produced:
docker run --rm --entrypoint ls \
  learning_ai_common_plat-admin-web:latest /app | grep server.js

# Start it:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
  up -d admin-web

The container's restart policy will be set by the compose file (unless-stopped). Once the image is healthy, this is safe.


Ongoing health targets

Metric Healthy Warning Critical
Disk usage / < 55% 5570% > 70%
Load average < 4.0 48 > 8
Available RAM > 3 GB 13 GB < 1 GB
Swap used < 1 GB 13 GB > 3 GB
Container restart count < 5 520 > 20
Docker build cache < 5 GB 520 GB > 20 GB

Reference: safe cleanup commands

# Always safe (just prunes unreferenced build layers)
docker builder prune -f

# Safe: removes stopped containers, unused networks, dangling images only
docker system prune -f

# Safe: removes packages not referenced by any installed node_modules
pnpm store prune

# Safe: vacuum journal to size limit
journalctl --vacuum-size=200M

# Safe: clear apt cache
apt-get clean

# Safe: clear npm cache
npm cache clean --force

# Careful: removes ALL images not used by a running container (rebuilds needed)
docker image prune -a -f

Crash-loop detection (manual check)

# Show containers that have restarted more than 10 times
docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \
  | awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}'

# Show container logs for any that are restarting
docker events --filter event=restart --since 1h

The vm-health-check.sh script runs these checks automatically.


Script Purpose
scripts/VMs/HostingerVM/vm-health-check.sh Daily read-only health check + alerts
scripts/VMs/HostingerVM/vm-cleanup.sh Periodic safe cleanup
scripts/VMs/HostingerVM/CRON_SETUP.md Cron wiring
scripts/ubuntu-vm-security-update.sh Security patching
scripts/VMs/HostingerVM/login.sh SSH into the VM