- vm-health-check.sh: read-only checks for disk, load, RAM, swap, Docker containers (crash-loops + healthchecks), build cache, journal. Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT. - vm-cleanup.sh: safe periodic cleanup. Default (weekly): build cache, journal, apt, npm, .next/cache. --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup. --dry-run, --install-cron, --uninstall-cron. Logs to /var/log/vm-cleanup.log. Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
236 lines
7.2 KiB
Markdown
236 lines
7.2 KiB
Markdown
# Hostinger VM — Maintenance & Incident Reference
|
||
|
||
**VM:** `srv1491630.hstgr.cloud` · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk
|
||
**Key services:** `hermes-gateway`, `ollama`, Docker (~40 containers), `learning_ai_common_plat` stack
|
||
|
||
---
|
||
|
||
## Quick-start for day-to-day ops
|
||
|
||
```bash
|
||
# Check VM health (read-only, safe any time)
|
||
bash scripts/VMs/HostingerVM/vm-health-check.sh
|
||
|
||
# Weekly safe cleanup
|
||
bash scripts/VMs/HostingerVM/vm-cleanup.sh
|
||
|
||
# Monthly deeper cleanup
|
||
bash scripts/VMs/HostingerVM/vm-cleanup.sh --full
|
||
|
||
# Cron setup (run once)
|
||
bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
|
||
```
|
||
|
||
See [`CRON_SETUP.md`](../scripts/VMs/HostingerVM/CRON_SETUP.md) for full details.
|
||
|
||
---
|
||
|
||
## Incident Report — Load Average 1305 (2026-05-26)
|
||
|
||
### What happened
|
||
|
||
The VM became completely unresponsive. Load average reached **1305** (normal < 4 on 4 CPUs).
|
||
|
||
```
|
||
load average: 1305.54, 1339.23, 1302.41
|
||
RAM: 13 / 15 GB used, ZERO swap configured
|
||
```
|
||
|
||
**Single root cause:** one broken Docker container crash-looped **1,336 times** over ~25 hours.
|
||
|
||
Container: `learning_ai_common_plat-admin-web-1`
|
||
Error: `Cannot find module '/app/server.js'`
|
||
Restart policy: `unless-stopped` (no backoff limit, retries forever)
|
||
|
||
Each restart spawned ~3 OS processes:
|
||
- `containerd-shim-runc-v2`
|
||
- veth network interface creation
|
||
- `networkctl` call for the new interface
|
||
|
||
With 1,336 restarts × ~3 procs = **~4,000 processes** — the kernel scheduler thrashed → load 1305.
|
||
|
||
### Why the container was broken
|
||
|
||
The `admin-web` Docker image had no `server.js` because its Next.js build failed silently. Three bugs stacked:
|
||
|
||
| Bug | File | Detail |
|
||
|-----|------|--------|
|
||
| Missing build secret | `docker-compose.ecosystem.yml` | `admin-web` service was missing `<<: *product-build` anchor, so `GITEA_NPM_TOKEN` was never passed as a BuildKit secret → `pnpm install` of `@bytelyst/*` packages failed |
|
||
| Missing COPY step | `dashboards/admin-web/Dockerfile` | `tsconfig.base.json` (monorepo root) was not copied into the build context → `tsc` couldn't find it → build failed |
|
||
| Wrong pnpm flag | `dashboards/admin-web/Dockerfile` | `--legacy-peer-deps` is an npm flag, not valid in pnpm 10 → install step exited early |
|
||
|
||
Because the build stage failed, `COPY --from=builder .next/standalone ./` copied nothing, leaving the runner stage with an empty `/app` — no `server.js`.
|
||
|
||
### Timeline
|
||
|
||
| Time (UTC) | Event |
|
||
|---|---|
|
||
| 2026-05-26 04:43 | VM booted, Docker started |
|
||
| 2026-05-26 04:56 | `admin-web` first restart (count=1) |
|
||
| 2026-05-26 ~05:00–06:07 | Load climbs steadily, RAM fills |
|
||
| 2026-05-26 ~ongoing | 1,336 restarts over 25 hours |
|
||
| 2026-05-27 06:07 | VM rebooted (load avg recorded: 1305) |
|
||
| 2026-05-27 06:28 | Diagnosis session started (load: 0.55 after reboot) |
|
||
| 2026-05-27 08:20 | All fixes applied, cleanup complete |
|
||
|
||
### Secondary problems found
|
||
|
||
| Issue | Detail |
|
||
|---|---|
|
||
| **No swap** | Zero swap configured — OOM kills inevitable under memory pressure |
|
||
| **84 GB Docker build cache** | Never pruned; Next.js/TSC builds accumulate enormous layer cache |
|
||
| **12 GB HOLD node_modules** | Archived projects in `/opt/bytelyst/HOLD` had deps never cleaned up |
|
||
| **~3 GB .next/cache** | Build-time caches in active and HOLD repos |
|
||
| **381 MB uncompressed logs** | `syslog.1`, `kern.log.1` not compressed; no size/retention limits on journal |
|
||
| **No crash-loop detection** | Nothing alerting on containers restarting > N times |
|
||
|
||
---
|
||
|
||
## Fixes Applied (2026-05-27)
|
||
|
||
### 1. Crash loop — stopped
|
||
|
||
Patched `/var/lib/docker/containers/2219091e.../hostconfig.json` while Docker was stopped:
|
||
```json
|
||
"RestartPolicy": {"Name": "no", "MaximumRetryCount": 0}
|
||
```
|
||
|
||
Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling.
|
||
|
||
### 2. Swap — added
|
||
|
||
```bash
|
||
fallocate -l 4G /swapfile
|
||
chmod 600 /swapfile
|
||
mkswap /swapfile
|
||
swapon /swapfile
|
||
echo '/swapfile none swap sw 0 0' >> /etc/fstab
|
||
sysctl vm.swappiness=10
|
||
echo 'vm.swappiness=10' >> /etc/sysctl.conf
|
||
```
|
||
|
||
### 3. Disk — 79 GB reclaimed (70% → 27%)
|
||
|
||
| Action | Freed |
|
||
|---|---|
|
||
| `docker builder prune -f` | 84 GB |
|
||
| `docker system prune -f` | 107 MB |
|
||
| HOLD node_modules deleted | ~12 GB |
|
||
| HOLD `.next` build caches | ~1.2 GB |
|
||
| Active `.next/cache` dirs | ~2.4 GB |
|
||
| Old Claude CLI versions | ~940 MB |
|
||
| npm cache clean | ~1.8 GB |
|
||
| Journal vacuum | ~220 MB |
|
||
| apt clean | ~280 MB |
|
||
|
||
### 4. Log management
|
||
|
||
`/etc/systemd/journald.conf.d/size-limits.conf`:
|
||
```ini
|
||
[Journal]
|
||
SystemMaxUse=200M
|
||
SystemKeepFree=1G
|
||
MaxRetentionSec=7day
|
||
MaxFileSec=1day
|
||
```
|
||
|
||
`/etc/rsyslog.d/20-ufw-filter.conf`:
|
||
```
|
||
:msg, contains, "[UFW BLOCK]" stop
|
||
```
|
||
|
||
`/etc/logrotate.d/rsyslog-custom`: daily rotation, 7-day retention, compress-on-rotate.
|
||
|
||
### 5. Dockerfile fixes (ready, not yet deployed)
|
||
|
||
`docker-compose.ecosystem.yml` — added `<<: *product-build` to `admin-web` build section
|
||
`dashboards/admin-web/Dockerfile` — added `tsconfig.base.json` to COPY, removed `--legacy-peer-deps`
|
||
|
||
---
|
||
|
||
## Deploying admin-web (when ready)
|
||
|
||
```bash
|
||
cd /opt/bytelyst/learning_ai_common_plat
|
||
GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \
|
||
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
|
||
build admin-web
|
||
|
||
# Verify the standalone build was produced:
|
||
docker run --rm --entrypoint ls \
|
||
learning_ai_common_plat-admin-web:latest /app | grep server.js
|
||
|
||
# Start it:
|
||
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
|
||
up -d admin-web
|
||
```
|
||
|
||
The container's restart policy will be set by the compose file (`unless-stopped`). Once the image is healthy, this is safe.
|
||
|
||
---
|
||
|
||
## Ongoing health targets
|
||
|
||
| Metric | Healthy | Warning | Critical |
|
||
|---|---|---|---|
|
||
| Disk usage `/` | < 55% | 55–70% | > 70% |
|
||
| Load average | < 4.0 | 4–8 | > 8 |
|
||
| Available RAM | > 3 GB | 1–3 GB | < 1 GB |
|
||
| Swap used | < 1 GB | 1–3 GB | > 3 GB |
|
||
| Container restart count | < 5 | 5–20 | > 20 |
|
||
| Docker build cache | < 5 GB | 5–20 GB | > 20 GB |
|
||
|
||
---
|
||
|
||
## Reference: safe cleanup commands
|
||
|
||
```bash
|
||
# Always safe (just prunes unreferenced build layers)
|
||
docker builder prune -f
|
||
|
||
# Safe: removes stopped containers, unused networks, dangling images only
|
||
docker system prune -f
|
||
|
||
# Safe: removes packages not referenced by any installed node_modules
|
||
pnpm store prune
|
||
|
||
# Safe: vacuum journal to size limit
|
||
journalctl --vacuum-size=200M
|
||
|
||
# Safe: clear apt cache
|
||
apt-get clean
|
||
|
||
# Safe: clear npm cache
|
||
npm cache clean --force
|
||
|
||
# Careful: removes ALL images not used by a running container (rebuilds needed)
|
||
docker image prune -a -f
|
||
```
|
||
|
||
---
|
||
|
||
## Crash-loop detection (manual check)
|
||
|
||
```bash
|
||
# Show containers that have restarted more than 10 times
|
||
docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \
|
||
| awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}'
|
||
|
||
# Show container logs for any that are restarting
|
||
docker events --filter event=restart --since 1h
|
||
```
|
||
|
||
The `vm-health-check.sh` script runs these checks automatically.
|
||
|
||
---
|
||
|
||
## Related scripts
|
||
|
||
| Script | Purpose |
|
||
|---|---|
|
||
| `scripts/VMs/HostingerVM/vm-health-check.sh` | Daily read-only health check + alerts |
|
||
| `scripts/VMs/HostingerVM/vm-cleanup.sh` | Periodic safe cleanup |
|
||
| `scripts/VMs/HostingerVM/CRON_SETUP.md` | Cron wiring |
|
||
| `scripts/ubuntu-vm-security-update.sh` | Security patching |
|
||
| `scripts/VMs/HostingerVM/login.sh` | SSH into the VM |
|