bytelyst-devops-tools/scripts/VMs/HostingerVM/CRON_SETUP.md
Hermes VM 0a2d303f93 add HostingerVM health-check and cleanup scripts
- vm-health-check.sh: read-only checks for disk, load, RAM, swap,
  Docker containers (crash-loops + healthchecks), build cache, journal.
  Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT.

- vm-cleanup.sh: safe periodic cleanup.
  Default (weekly): build cache, journal, apt, npm, .next/cache.
  --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup.
  --dry-run, --install-cron, --uninstall-cron.
  Logs to /var/log/vm-cleanup.log.

Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00

151 lines
4.9 KiB
Markdown

# Hostinger VM — Cron Setup
Automated maintenance schedule for `srv1491630`.
Scripts: `vm-health-check.sh` (read-only) + `vm-cleanup.sh` (safe cleanup).
---
## Quick install
SSH into the VM and run:
```bash
bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
```
This installs the full recommended schedule. To remove it:
```bash
bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --uninstall-cron
```
---
## What gets scheduled
| Schedule | Time (UTC) | Command | What it does |
|---|---|---|---|
| Daily | 07:00 | `vm-health-check.sh` | Read-only check; sends Telegram alert on WARNING/CRITICAL |
| Daily | 03:00 | `vm-cleanup.sh` | Prune Docker build cache only (always safe) |
| Weekly | Sun 02:00 | `vm-cleanup.sh` | Standard cleanup (see below) |
| Monthly | 1st 01:00 | `vm-cleanup.sh --full` | Full cleanup (see below) |
---
## What each mode does
### Standard weekly cleanup (`vm-cleanup.sh`)
All steps are labelled **SAFE** — they only remove regenerable caches.
| Step | What's removed | Risk |
|---|---|---|
| Docker build cache | Layer cache from `docker build` runs | Zero — rebuilds just take longer next time |
| Crash loop check | Detection only, no changes | Zero |
| Journal vacuum | Old journal entries beyond 200MB / 7 days | Zero — logs are already captured in syslog |
| APT cache | `/var/cache/apt/archives/` | Zero — packages can be re-downloaded |
| NPM cache | `~/.npm/_cacache/` | Zero — cache is re-populated on next `npm install` |
| `.next/cache` | Webpack/babel/TSC build cache dirs | Zero — rebuilt automatically on next `next build` |
### Monthly full cleanup (`vm-cleanup.sh --full`)
Adds these **CAREFUL** steps on top of the standard run:
| Step | What's removed | Risk |
|---|---|---|
| Docker system prune | Stopped containers, unused networks, dangling images | Low — does NOT remove images used by any container |
| pnpm store prune | Packages not referenced by any `node_modules` | Low — only removes truly orphaned packages |
| Old log files | `.gz` log rotations older than 30 days | Low — old compressed logs |
| HOLD node_modules | `node_modules` in `/opt/bytelyst/HOLD` archived projects | Low — code intact, can reinstall with `pnpm install` |
### Never touched (by design)
- `/opt/bytelyst/*/node_modules` (active repos)
- `/opt/bytelyst/*/src`, `/app`, `/backend`, `/web` source code
- `.next/standalone` (production Next.js builds)
- Docker images used by currently configured containers
- `/usr/local/lib/hermes-agent/`
- `/usr/share/ollama/` (models)
- `/swapfile`
- Any database volumes
---
## Manual crontab (if you prefer not to use --install-cron)
```
# Health check daily 07:00 UTC
0 7 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --quiet --notify 2>&1 | logger -t vm-health
# Build cache prune daily 03:00 UTC
0 3 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup
# Standard weekly cleanup Sunday 02:00 UTC
0 2 * * 0 bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup
# Full monthly cleanup 1st of month 01:00 UTC
0 1 1 * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --full --quiet 2>&1 | logger -t vm-cleanup
```
Edit with: `crontab -e`
---
## Monitoring logs
```bash
# Tail cleanup log
tail -f /var/log/vm-cleanup.log
# Tail health check log
tail -f /var/log/vm-health-check.log
# See all cron output via syslog
grep vm-cleanup /var/log/syslog | tail -20
grep vm-health /var/log/syslog | tail -20
```
---
## Telegram alerts
The health check script sends a Telegram message when it detects WARNING or CRITICAL.
It reads credentials from `$HERMES_HOME/.env` (usually `/root/.hermes/.env`).
Required keys in that file:
```
TELEGRAM_BOT_TOKEN=<your-bot-token>
TELEGRAM_CHAT_ID=<your-chat-id>
```
Both are already set if Hermes gateway is configured. Test with:
```bash
bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --notify
```
---
## Disk thresholds (from `vm-health-check.sh`)
| Metric | WARNING | CRITICAL |
|---|---|---|
| Disk used `%` | > 55% | > 70% |
| Load average | > 4.0 | > 8.0 |
| RAM available | < 3 GB | < 1 GB |
| Swap used | > 1 GB | > 3 GB |
| Container restarts | > 10 | > 50 |
| Build cache | > 5 GB | > 20 GB |
Thresholds are constants at the top of each script — easy to adjust.
---
## What the May 2026 incident would have caught
If this cron had been running during the May 26 incident:
- **07:00 daily health check** → `container_loops CRIT: admin-web(50x)` → Telegram alert sent within hours of the loop starting
- **03:00 daily build cache prune** → would have kept build cache under 5 GB instead of growing to 84 GB
- **Monthly full cleanup** → would have cleared the HOLD node_modules and old logs before they became a storage crisis