bytelyst-devops-tools/docs/hostinger-vm-maintenance.md
Hermes VM 0a2d303f93 add HostingerVM health-check and cleanup scripts
- vm-health-check.sh: read-only checks for disk, load, RAM, swap,
  Docker containers (crash-loops + healthchecks), build cache, journal.
  Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT.

- vm-cleanup.sh: safe periodic cleanup.
  Default (weekly): build cache, journal, apt, npm, .next/cache.
  --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup.
  --dry-run, --install-cron, --uninstall-cron.
  Logs to /var/log/vm-cleanup.log.

Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00

236 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Hostinger VM — Maintenance & Incident Reference
**VM:** `srv1491630.hstgr.cloud` · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk
**Key services:** `hermes-gateway`, `ollama`, Docker (~40 containers), `learning_ai_common_plat` stack
---
## Quick-start for day-to-day ops
```bash
# Check VM health (read-only, safe any time)
bash scripts/VMs/HostingerVM/vm-health-check.sh
# Weekly safe cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh
# Monthly deeper cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh --full
# Cron setup (run once)
bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
```
See [`CRON_SETUP.md`](../scripts/VMs/HostingerVM/CRON_SETUP.md) for full details.
---
## Incident Report — Load Average 1305 (2026-05-26)
### What happened
The VM became completely unresponsive. Load average reached **1305** (normal < 4 on 4 CPUs).
```
load average: 1305.54, 1339.23, 1302.41
RAM: 13 / 15 GB used, ZERO swap configured
```
**Single root cause:** one broken Docker container crash-looped **1,336 times** over ~25 hours.
Container: `learning_ai_common_plat-admin-web-1`
Error: `Cannot find module '/app/server.js'`
Restart policy: `unless-stopped` (no backoff limit, retries forever)
Each restart spawned ~3 OS processes:
- `containerd-shim-runc-v2`
- veth network interface creation
- `networkctl` call for the new interface
With 1,336 restarts × ~3 procs = **~4,000 processes** the kernel scheduler thrashed load 1305.
### Why the container was broken
The `admin-web` Docker image had no `server.js` because its Next.js build failed silently. Three bugs stacked:
| Bug | File | Detail |
|-----|------|--------|
| Missing build secret | `docker-compose.ecosystem.yml` | `admin-web` service was missing `<<: *product-build` anchor, so `GITEA_NPM_TOKEN` was never passed as a BuildKit secret `pnpm install` of `@bytelyst/*` packages failed |
| Missing COPY step | `dashboards/admin-web/Dockerfile` | `tsconfig.base.json` (monorepo root) was not copied into the build context `tsc` couldn't find it build failed |
| Wrong pnpm flag | `dashboards/admin-web/Dockerfile` | `--legacy-peer-deps` is an npm flag, not valid in pnpm 10 install step exited early |
Because the build stage failed, `COPY --from=builder .next/standalone ./` copied nothing, leaving the runner stage with an empty `/app` no `server.js`.
### Timeline
| Time (UTC) | Event |
|---|---|
| 2026-05-26 04:43 | VM booted, Docker started |
| 2026-05-26 04:56 | `admin-web` first restart (count=1) |
| 2026-05-26 ~05:0006:07 | Load climbs steadily, RAM fills |
| 2026-05-26 ~ongoing | 1,336 restarts over 25 hours |
| 2026-05-27 06:07 | VM rebooted (load avg recorded: 1305) |
| 2026-05-27 06:28 | Diagnosis session started (load: 0.55 after reboot) |
| 2026-05-27 08:20 | All fixes applied, cleanup complete |
### Secondary problems found
| Issue | Detail |
|---|---|
| **No swap** | Zero swap configured OOM kills inevitable under memory pressure |
| **84 GB Docker build cache** | Never pruned; Next.js/TSC builds accumulate enormous layer cache |
| **12 GB HOLD node_modules** | Archived projects in `/opt/bytelyst/HOLD` had deps never cleaned up |
| **~3 GB .next/cache** | Build-time caches in active and HOLD repos |
| **381 MB uncompressed logs** | `syslog.1`, `kern.log.1` not compressed; no size/retention limits on journal |
| **No crash-loop detection** | Nothing alerting on containers restarting > N times |
---
## Fixes Applied (2026-05-27)
### 1. Crash loop — stopped
Patched `/var/lib/docker/containers/2219091e.../hostconfig.json` while Docker was stopped:
```json
"RestartPolicy": {"Name": "no", "MaximumRetryCount": 0}
```
Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling.
### 2. Swap — added
```bash
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
sysctl vm.swappiness=10
echo 'vm.swappiness=10' >> /etc/sysctl.conf
```
### 3. Disk — 79 GB reclaimed (70% → 27%)
| Action | Freed |
|---|---|
| `docker builder prune -f` | 84 GB |
| `docker system prune -f` | 107 MB |
| HOLD node_modules deleted | ~12 GB |
| HOLD `.next` build caches | ~1.2 GB |
| Active `.next/cache` dirs | ~2.4 GB |
| Old Claude CLI versions | ~940 MB |
| npm cache clean | ~1.8 GB |
| Journal vacuum | ~220 MB |
| apt clean | ~280 MB |
### 4. Log management
`/etc/systemd/journald.conf.d/size-limits.conf`:
```ini
[Journal]
SystemMaxUse=200M
SystemKeepFree=1G
MaxRetentionSec=7day
MaxFileSec=1day
```
`/etc/rsyslog.d/20-ufw-filter.conf`:
```
:msg, contains, "[UFW BLOCK]" stop
```
`/etc/logrotate.d/rsyslog-custom`: daily rotation, 7-day retention, compress-on-rotate.
### 5. Dockerfile fixes (ready, not yet deployed)
`docker-compose.ecosystem.yml` — added `<<: *product-build` to `admin-web` build section
`dashboards/admin-web/Dockerfile` — added `tsconfig.base.json` to COPY, removed `--legacy-peer-deps`
---
## Deploying admin-web (when ready)
```bash
cd /opt/bytelyst/learning_ai_common_plat
GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
build admin-web
# Verify the standalone build was produced:
docker run --rm --entrypoint ls \
learning_ai_common_plat-admin-web:latest /app | grep server.js
# Start it:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
up -d admin-web
```
The container's restart policy will be set by the compose file (`unless-stopped`). Once the image is healthy, this is safe.
---
## Ongoing health targets
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Disk usage `/` | < 55% | 5570% | > 70% |
| Load average | < 4.0 | 48 | > 8 |
| Available RAM | > 3 GB | 13 GB | < 1 GB |
| Swap used | < 1 GB | 13 GB | > 3 GB |
| Container restart count | < 5 | 520 | > 20 |
| Docker build cache | < 5 GB | 520 GB | > 20 GB |
---
## Reference: safe cleanup commands
```bash
# Always safe (just prunes unreferenced build layers)
docker builder prune -f
# Safe: removes stopped containers, unused networks, dangling images only
docker system prune -f
# Safe: removes packages not referenced by any installed node_modules
pnpm store prune
# Safe: vacuum journal to size limit
journalctl --vacuum-size=200M
# Safe: clear apt cache
apt-get clean
# Safe: clear npm cache
npm cache clean --force
# Careful: removes ALL images not used by a running container (rebuilds needed)
docker image prune -a -f
```
---
## Crash-loop detection (manual check)
```bash
# Show containers that have restarted more than 10 times
docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \
| awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}'
# Show container logs for any that are restarting
docker events --filter event=restart --since 1h
```
The `vm-health-check.sh` script runs these checks automatically.
---
## Related scripts
| Script | Purpose |
|---|---|
| `scripts/VMs/HostingerVM/vm-health-check.sh` | Daily read-only health check + alerts |
| `scripts/VMs/HostingerVM/vm-cleanup.sh` | Periodic safe cleanup |
| `scripts/VMs/HostingerVM/CRON_SETUP.md` | Cron wiring |
| `scripts/ubuntu-vm-security-update.sh` | Security patching |
| `scripts/VMs/HostingerVM/login.sh` | SSH into the VM |