Hermes VM 0a2d303f93 add HostingerVM health-check and cleanup scripts

- vm-health-check.sh: read-only checks for disk, load, RAM, swap,
  Docker containers (crash-loops + healthchecks), build cache, journal.
  Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT.

- vm-cleanup.sh: safe periodic cleanup.
  Default (weekly): build cache, journal, apt, npm, .next/cache.
  --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup.
  --dry-run, --install-cron, --uninstall-cron.
  Logs to /var/log/vm-cleanup.log.

Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-27 18:53:20 +00:00

4.9 KiB

Raw Blame History

Hostinger VM — Cron Setup

Automated maintenance schedule for srv1491630. Scripts: vm-health-check.sh (read-only) + vm-cleanup.sh (safe cleanup).

Quick install

SSH into the VM and run:

bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron

This installs the full recommended schedule. To remove it:

bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --uninstall-cron

What gets scheduled

Schedule	Time (UTC)	Command	What it does
Daily	07:00	`vm-health-check.sh`	Read-only check; sends Telegram alert on WARNING/CRITICAL
Daily	03:00	`vm-cleanup.sh`	Prune Docker build cache only (always safe)
Weekly	Sun 02:00	`vm-cleanup.sh`	Standard cleanup (see below)
Monthly	1st 01:00	`vm-cleanup.sh --full`	Full cleanup (see below)

What each mode does

Standard weekly cleanup (`vm-cleanup.sh`)

All steps are labelled SAFE — they only remove regenerable caches.

Step	What's removed	Risk
Docker build cache	Layer cache from `docker build` runs	Zero — rebuilds just take longer next time
Crash loop check	Detection only, no changes	Zero
Journal vacuum	Old journal entries beyond 200MB / 7 days	Zero — logs are already captured in syslog
APT cache	`/var/cache/apt/archives/`	Zero — packages can be re-downloaded
NPM cache	`~/.npm/_cacache/`	Zero — cache is re-populated on next `npm install`
`.next/cache`	Webpack/babel/TSC build cache dirs	Zero — rebuilt automatically on next `next build`

Monthly full cleanup (`vm-cleanup.sh --full`)

Adds these CAREFUL steps on top of the standard run:

Step	What's removed	Risk
Docker system prune	Stopped containers, unused networks, dangling images	Low — does NOT remove images used by any container
pnpm store prune	Packages not referenced by any `node_modules`	Low — only removes truly orphaned packages
Old log files	`.gz` log rotations older than 30 days	Low — old compressed logs
HOLD node_modules	`node_modules` in `/opt/bytelyst/HOLD` archived projects	Low — code intact, can reinstall with `pnpm install`

Never touched (by design)

/opt/bytelyst/*/node_modules (active repos)
/opt/bytelyst/*/src, /app, /backend, /web source code
.next/standalone (production Next.js builds)
Docker images used by currently configured containers
/usr/local/lib/hermes-agent/
/usr/share/ollama/ (models)
/swapfile
Any database volumes

Manual crontab (if you prefer not to use --install-cron)

# Health check daily 07:00 UTC
0 7 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --quiet --notify 2>&1 | logger -t vm-health

# Build cache prune daily 03:00 UTC
0 3 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup

# Standard weekly cleanup Sunday 02:00 UTC
0 2 * * 0 bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup

# Full monthly cleanup 1st of month 01:00 UTC
0 1 1 * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --full --quiet 2>&1 | logger -t vm-cleanup

Edit with: crontab -e

Monitoring logs

# Tail cleanup log
tail -f /var/log/vm-cleanup.log

# Tail health check log
tail -f /var/log/vm-health-check.log

# See all cron output via syslog
grep vm-cleanup /var/log/syslog | tail -20
grep vm-health /var/log/syslog | tail -20

Telegram alerts

The health check script sends a Telegram message when it detects WARNING or CRITICAL. It reads credentials from $HERMES_HOME/.env (usually /root/.hermes/.env).

Required keys in that file:

TELEGRAM_BOT_TOKEN=<your-bot-token>
TELEGRAM_CHAT_ID=<your-chat-id>

Both are already set if Hermes gateway is configured. Test with:

bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --notify

Disk thresholds (from `vm-health-check.sh`)

Metric	WARNING	CRITICAL
Disk used `%`	> 55%	> 70%
Load average	> 4.0	> 8.0
RAM available	< 3 GB	< 1 GB
Swap used	> 1 GB	> 3 GB
Container restarts	> 10	> 50
Build cache	> 5 GB	> 20 GB

Thresholds are constants at the top of each script — easy to adjust.

What the May 2026 incident would have caught

If this cron had been running during the May 26 incident:

07:00 daily health check → container_loops CRIT: admin-web(50x) → Telegram alert sent within hours of the loop starting
03:00 daily build cache prune → would have kept build cache under 5 GB instead of growing to 84 GB
Monthly full cleanup → would have cleared the HOLD node_modules and old logs before they became a storage crisis

4.9 KiB Raw Blame History