- Switch backend runner from node:20-alpine to node:20-slim so GNU df flags (--output=pcent/avail) work inside the container - Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw), docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars - Rebuild repository.ts: env-configurable paths, cron history parser, unhealthy-container inspector, Ollama model endpoints - Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama models, container restart, model unload - vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix (( count++ )) → count=$(( count + 1 )) for set -e compatibility - Add docs/VM_OBSERVABILITY_ROADMAP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
22 KiB
VM Observability & Control Roadmap — v2
Status: Draft — Pending Approval
Last updated: 2026-05-27
Scope: srv1491630 (Hostinger VM) + DevOps Dashboard (devops.bytelyst.com)
Reviewed: Yes — v1 audited against live system; 11 issues corrected (see change log at bottom)
Current State Snapshot
| Layer | What exists today | Verified gap |
|---|---|---|
| Health check | vm-health-check.sh — disk, load, RAM, swap, Docker |
No steal time metric; no per-container detail |
| Cleanup | vm-cleanup.sh — build cache, images, logs, apt, pnpm, HOLD |
Runs silently; no structured outcome record |
| Cron | 4 scheduled jobs (daily / weekly / monthly) | No execution history; no "last ran / freed X" |
| Dashboard /vm | Health check + cleanup log tail + trigger button | VM module is non-functional — container has no host volume mounts; all backend calls to host scripts fail silently |
| Dashboard /system | CPU, RAM, disk, Docker stats | Missing steal %, container detail, unhealthy drill-down |
| Prometheus stack | Prometheus + cAdvisor + node-exporter + Loki — ~2 weeks history | No Grafana; trend data exists but no UI to query it |
| Alerting | Telegram on WARN/CRIT at 07:00 UTC | No steal time alert; no weekly digest; no cron failure alert |
| Container restart | 38/39 containers have unless-stopped |
unless-stopped restarts on process exit only — does NOT react to health check failures. 7 containers running but unhealthy (process alive, health endpoint dead) |
| LLMs (Ollama) | 9 models on disk; qwen2.5-coder:1.5b currently loaded (1.1 GB, 100% CPU) |
No RAM impact warning before loading; no dashboard visibility |
| I/O anomaly | invttrdg-backend writing ~22 GB/day to block storage |
Unexplained — no alert, no investigation |
Architectural Decisions (settle these before building)
A. Trend chart data source
Options:
- ✅ Query existing Prometheus from DevOps dashboard (recommended) — data already there, no new store needed. Add Prometheus query endpoints to dashboard backend, render with a chart library.
- ➕ Add Grafana container alongside Prometheus — purpose-built for metrics UI, out-of-box dashboards. Extra 80–150 MB RAM.
- ❌ New Cosmos DB vm-metrics container — redundant with Prometheus; wrong tool for time-series.
Recommendation: Query Prometheus from the dashboard for Phase 4.2 charts (keeps everything in one UI). Add Grafana in Phase 5 only if dashboard charts feel limiting.
B. Dashboard → host script execution
The devops-backend container currently has no host volume mounts and no sudoers entry. Phase 3.2 "Run cleanup from dashboard" requires one of:
- ✅ Mount host script + Docker socket into devops-backend (simplest, lowest risk)
- ➕ Thin host-side agent (systemd socket-activated, receives commands via Unix socket)
- ❌ SSH from container to host — unnecessary complexity
Recommendation: Mount /opt/bytelyst/learning_ai_devops_tools/scripts read-only + /var/log for log reading into devops-backend. Add sudoers entry for the cleanup script only.
Phase 0 — Fix Broken Foundations (Day 1–2, prerequisite for all UI phases)
These are not new features — they are bugs in the current system.
0.1 Fix devops-backend VM module (host volume mounts)
Problem: GET /api/vm/health, GET /api/vm/cleanup-log, POST /api/vm/cleanup all fail because the container has no access to the host filesystem.
Fix: Update docker-compose.yml for devops-backend:
volumes:
- /opt/bytelyst/learning_ai_devops_tools/scripts:/scripts:ro
- /var/log/vm-cleanup.log:/var/log/vm-cleanup.log:ro
- /var/log/vm-health-check.log:/var/log/vm-health-check.log:ro
Update repository.ts to use /scripts/VMs/HostingerVM/vm-cleanup.sh path, or use env var VM_SCRIPTS_PATH.
Add sudoers entry: nobody ALL=(ALL) NOPASSWD: /scripts/VMs/HostingerVM/vm-cleanup.sh
Risk: Low. Read-only mounts for scripts, append-only for logs.
Validates: Run curl http://localhost:4004/api/vm/health and confirm JSON response.
0.2 Add logrotate entry for new log files
Problem: /var/log/vm-cleanup.log and /var/log/vm-health-check.log have no rotation policy. Will grow unbounded.
Fix: Create /etc/logrotate.d/bytelyst-vm:
/var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log {
weekly
rotate 8
compress
delaycompress
missingok
notifempty
create 0644 root root
}
0.3 Investigate invttrdg-backend I/O anomaly
Problem: 22.2 GB block writes in 13 hours (~1.7 GB/hr). At this rate: 40 GB/day, will fill the 123 GB free disk in ~3 days of heavy trading activity. Fix path: Check what's being written (WAL logs? tick data? verbose debug logging?). Likely a log level or persistence config issue. Add disk usage alert specific to this container. Risk of not fixing: Disk fills up, all services go down.
Phase 1 — Observability Gaps (Week 1)
Read-only additions to existing scripts and the /vm dashboard page.
1.1 Cron Job Execution History Panel
Where: Dashboard /vm page — new "Maintenance Schedule" card
What: Add GET /api/vm/cron-status endpoint that:
- Parses crontab entries for the 4 managed jobs (look for
# bytelyst-vm-maintenanceblock) - Parses
/var/log/vm-cleanup.loginto structured run objects:{ timestamp, mode, diskBefore, diskAfter, freedMB, steps[], success } - Calculates next run from cron expression
UI: Table — job name | schedule | last run | freed | status | next run. Expandable row shows step-by-step log. Dependency: Requires Phase 0.1 (volume mount for log access).
1.2 CPU Steal Time Metric
Where: vm-health-check.sh + dashboard /vm health cards
What: Sample /proc/stat twice 1 second apart, compute steal %:
read_steal() { awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat; }
s1=$(read_steal); sleep 1; s2=$(read_steal)
steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{
split(s1,a," "); split(s2,b," ")
delta_steal=b[1]-a[1]; delta_total=b[2]-a[2]
printf "%.1f", (delta_steal/delta_total)*100
}')
Thresholds: > 5% = WARN, > 15% = CRIT.
Why: Currently at 8.2% — silently degrading every API response and LLM inference call.
Dependency: None. Self-contained script change.
1.3 Unhealthy Container Detail Panel
Where: Dashboard /vm — expand container health card
What: New GET /api/vm/containers/unhealthy endpoint:
- Container name,
unhealthysince (parsedocker inspect .State.Health.Log[0].Start) - Last 3 health check log lines
- Current restart count
UI: Expandable per-container row with one-click restart button (calls existing or new POST /api/vm/containers/:name/restart).
Dependency: Requires Phase 0.1.
1.4 Swap Pressure Indicator
Where: vm-health-check.sh + dashboard
What: Add SwapCached as secondary metric. High SwapCached relative to SwapUsed = system was recently under pressure even if swap looks ok now. Surface in daily Telegram alert even when overall = WARN not CRIT.
Threshold change: Current SWAP_USED_WARN_GB=1 triggers today (1.4 GB in use). Consider raising to 1.5 to reduce noise while keeping the SwapCached > 200MB as an early warning signal.
Phase 2 — Self-Healing Automation (Week 2)
Scripts that fix known recurring issues automatically.
2.1 Health-Check-Aware Container Watchdog
Why the existing policy isn't enough: All 38 containers already have unless-stopped. That policy restarts on container process exit only. When the web server process is alive but the health check endpoint returns Connection refused, Docker marks the container unhealthy but does not restart it — it keeps running indefinitely broken.
Fix: Systemd timer docker-health-watchdog.timer (runs every 10 minutes):
#!/bin/bash
# /usr/local/bin/docker-health-watchdog.sh
UNHEALTHY=$(docker ps --filter health=unhealthy --format '{{.Names}}')
for container in $UNHEALTHY; do
# Only restart if unhealthy for at least 3 consecutive checks (30 min)
failures=$(docker inspect "$container" | \
python3 -c "import json,sys; h=json.load(sys.stdin)[0]['State']['Health']['Log']; \
print(sum(1 for l in h[-3:] if l['ExitCode']!=0))")
if [[ "$failures" -eq 3 ]]; then
docker restart "$container"
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Auto-restarted: $container (unhealthy 3x)" \
>> /var/log/docker-watchdog.log
# Telegram notify (reads token from $HERMES_HOME/.env)
fi
done
Safety: Never restarts a container that just became unhealthy (3-check cooldown). Logs every restart. Only targets health-check failures, not intentionally stopped containers.
Rollback: systemctl disable docker-health-watchdog.timer
2.2 Fix hermes-root-backup Git Diverge
Current failure: Git fast-forward fails every ~10 minutes since 16:25 today (~30+ silent failures). Fix: Patch the backup script to handle diverge gracefully:
if ! git pull --ff-only 2>/dev/null; then
# Log the diverge
git log --oneline -3 HEAD > /tmp/hermes-diverge-before.txt
git log --oneline -3 origin/main >> /tmp/hermes-diverge-before.txt
# Try rebase first (preserves local commits if intentional)
if ! git pull --rebase 2>/dev/null; then
# If rebase fails, reset to origin (backup is the source of truth)
git reset --hard origin/main
notify_telegram "⚠️ hermes-root-backup: diverged branch reset to origin/main"
fi
fi
Risk: git reset --hard loses any local-only commits on the backup repo. Acceptable here because the backup script's job is to push to origin — local commits shouldn't exist. Add a pre-check: if local commits exist that aren't on origin, alert instead of resetting.
2.3 Container Memory Limits
Validated against actual RSS data (Phase 2 data collected 2026-05-27):
| Category | Current RSS | Proposed Limit | Reservation | Notes |
|---|---|---|---|---|
| Next.js web frontends | 17–37 MB | 256m |
64m |
7× headroom for webpack spikes |
| Node/Fastify backends | 20–67 MB | 384m |
128m |
Allows burst for LLM calls |
invttrdg-backend |
107 MB | 512m |
256m |
High I/O service; watch after 0.3 |
trading-backend |
92 MB | 512m |
256m |
Active algo trading service |
platform-service |
66 MB | 384m |
128m |
Shared auth/platform layer |
| CosmosDB emulator | 145 MB | 1g |
512m |
Can spike on write bursts |
| Prometheus | 57 MB | 256m |
128m |
Stable but grows with series |
| Loki | 53 MB | 256m |
128m |
Log ingestion can spike |
| Caddy | 27 MB | 128m |
64m |
Proxy, very stable |
| Valkey (Redis) | 3.5 MB | 128m |
32m |
Cache, tiny |
| Gitea | 79 MB | 512m |
256m |
Git operations can spike |
| Ollama | 130 MB idle | No limit | — | Must accommodate model load (up to 8 GB) |
Rollout strategy:
- Run
docker statsbaseline for 24h to confirm no container spikes beyond proposed limits - Apply limits per stack in docker-compose files (not
docker update— recreate on next deploy) - Monitor for OOMKill events:
dmesg | grep -i oomfor 48h after rollout - Never set limits on Ollama — model loading is unpredictable and limits would kill inference
Phase 3 — Dashboard Control Plane (Weeks 3–4)
Prerequisite for all Phase 3 items: Phase 0.1 (host volume mount) must be complete.
3.1 VM Score Card (Automated)
Where: Dashboard /vm — top summary widget, auto-refreshes every 5 min
Scoring algorithm (0–100):
CPU efficiency: 20 pts (steal < 2% = 20, < 5% = 15, < 10% = 10, ≥ 10% = 5)
Memory pressure: 20 pts (available > 6 GB = 20, > 3 GB = 15, > 1 GB = 5, else = 0)
Disk health: 15 pts (< 40% used = 15, < 55% = 10, < 70% = 5, else = 0)
Service health: 20 pts (0 unhealthy = 20, 1–2 = 15, 3–5 = 8, 6+ = 2)
Maintenance hygiene: 15 pts (last cleanup < 7 days + freed > 0 = 15, < 30 days = 8, else = 0)
LLM readiness: 10 pts (> 8 GB free RAM = 10, > 4 GB = 7, > 2 GB = 4, else = 1)
Score = sum. Display as gauge. Each dimension clickable to drill into its data. Dependencies: Phase 1.2 (steal time in health check output).
3.2 Cron Schedule & History Panel
Where: Dashboard /vm — "Maintenance" tab
What:
- Live table: 4 cron jobs × (last run, result, freed MB, next scheduled, "Run now" button)
- Last 30 cleanup runs as a sparkline: date vs MB freed
- One-click trigger for weekly / monthly / dry-run
Backend endpoint: GET /api/vm/cron-status — parse structured log + crontab
Dependency: Phase 0.1 (volume mount), Phase 1.1 (structured log parser).
3.3 Container Management Panel
Where: Dashboard /vm — "Containers" tab
What:
- Full list: name, stack, health status, uptime, CPU %, RAM, restart count
- Filter chips: All | Unhealthy | No Memory Limit | By stack
- Per-container: Restart, View last 50 log lines, Show health check history
- Bulk: "Restart all unhealthy" with confirmation modal
New backend endpoints: GET /api/vm/containers, POST /api/vm/containers/:name/restart, GET /api/vm/containers/:name/logs
3.4 Ollama / LLM Panel
Where: Dashboard /vm — "Models" tab
What:
- Models list: name, size, last used timestamp
- Currently loaded (from
ollama ps): model name, RAM used, CPU %, expires in - RAM visualisation bar: [used by system] [model if loaded] [free]
- Warning banner: "Loading llama3.2-vision (7.8 GB) will leave ~1.2 GB free — swap pressure likely"
- Load / Unload model buttons
Backend endpoints: GET /api/vm/ollama/models, POST /api/vm/ollama/load, DELETE /api/vm/ollama/unload
Note: qwen2.5-coder:1.5b is currently loaded — confirmed via ollama ps.
Phase 4 — Trend Analysis (Weeks 5–6)
Key architectural note: Prometheus + cAdvisor + node-exporter are already running and storing ~2 weeks of metrics history including steal time, disk I/O, memory, container CPU/RAM. Do NOT create a separate Cosmos DB store. Query Prometheus directly.
4.1 Prometheus Query Endpoints in Dashboard Backend
Where: New GET /api/vm/metrics/trend endpoint group
What: Proxy queries to internal Prometheus (http://prometheus:9090 within Docker network):
/api/vm/metrics/trend/disk?range=7d → disk usage % over time
/api/vm/metrics/trend/memory?range=7d → available RAM + swap used over time
/api/vm/metrics/trend/steal?range=7d → CPU steal % over time (once 1.2 is deployed)
/api/vm/metrics/trend/containers?range=7d → unhealthy container count over time
/api/vm/metrics/trend/io?range=7d → block write rate (flag invttrdg spikes)
Note: devops-backend is on dashboard_default network, Prometheus is on learning_ai_common_plat_default. Either add devops-backend to Prometheus network, or expose Prometheus on a host port (internal only, not via Caddy).
4.2 Trend Charts on Dashboard
Where: Dashboard /vm — collapsible "Trends" section below score card
What (7-day / 30-day toggle):
- Disk % over time + linear projection line → "estimated to hit 55% warning in X days"
- Swap used over time (detect slow memory leak)
- CPU steal % over time (detect host degradation trend)
- Unhealthy container count per day
- Block write rate: flag days with
invttrdg-backendanomalies
Library recommendation: Recharts (already likely in the Next.js project) or lightweight Chart.js wrapper.
4.3 Weekly Digest (Telegram)
Where: New cron job — Monday 08:00 UTC — vm-cleanup.sh --weekly-digest
What:
📊 Weekly VM Digest — srv1491630
Week ending 2026-06-01
🖥 CPU Steal: 8.2% avg ⚠️ (host contention — escalate if > 10%)
💾 Disk: 37% (freed 257 MB this week via cleanup)
🧠 RAM: 10 GB free avg ✓
🔄 Swap peak: 1.4 GB ⚠️
🐳 Containers: 7 unhealthy (action required)
🤖 LLMs run: qwen2.5-coder:1.5b (3 sessions this week)
🧹 Cleanups: 1 standard, 0 full
📅 Next full: 2026-06-01
Top action: Restart 7 unhealthy web containers
Dependency: Phase 4.1 (needs Prometheus for weekly averages), Phase 1.2 (steal metric must be in Prometheus).
Phase 5 — Advanced / Backlog
| Item | Description | Trigger condition |
|---|---|---|
| Add Grafana | Container alongside Prometheus for richer dashboards; pre-built node-exporter dashboards available | Phase 4 charts feel limited |
| Deployment ↔ health correlation | Mark deploys on trend charts; correlate health dips to specific releases | After Phase 4.2 exists |
| Multi-VM support | Extend all above to aggregate across VMs | Adding second VM |
invttrdg-backend write audit |
Persistent investigation: what generates 22 GB/day of block writes? Add per-container I/O alert | After Phase 0.3 |
| Chaos validation | Monthly: watchdog stops a test container, verify restart within 10 min, report result | After Phase 2.1 |
| Ollama GPU readiness check | Detect GPU availability, surface in LLM panel as "GPU: none — inference will be slow" | Before adding large models |
| Container image freshness | Alert when container is running image > 30 days old (not rebuilt) | When deploy pipeline matures |
| Cost attribution | Tag containers by product (trading, notes, clock...) — RAM/CPU cost per product | When billing needed |
| Backup health tracking | hermes-root-backup and uma-hermes-backup results surfaced in dashboard |
After Phase 2.2 |
Implementation Order
Day 1–2 Phase 0 ── Fix broken foundations (VM module, logrotate, I/O investigation)
⚠️ MUST complete before any Phase 3 dashboard work
Week 1 Phase 1 ── Observability (steal metric, cron history, unhealthy detail, swap)
1.2 (steal) → unblocks 3.1 (score card)
1.1 (cron log format) → unblocks 3.2 (cron panel)
Week 2 Phase 2 ── Self-healing (watchdog, hermes-backup fix, memory limits)
2.1 requires: logrotate entry (Phase 0.2)
2.3 requires: 24h baseline observation first
Weeks 3–4 Phase 3 ── Dashboard control (score card, cron panel, containers, Ollama)
All require: Phase 0.1 (host volume mount)
3.1 requires: Phase 1.2 deployed
3.2 requires: Phase 1.1 deployed
Weeks 5–6 Phase 4 ── Trend analysis (Prometheus queries, charts, weekly digest)
4.1 requires: devops-backend on same Docker network as Prometheus
4.2 requires: Phase 4.1
4.3 requires: Phase 4.1 + Phase 1.2
Backlog Phase 5 ── Advanced items, trigger-based
Success Criteria (how to know each phase is done)
| Phase | Done when… |
|---|---|
| 0.1 | curl localhost:4004/api/vm/health returns valid JSON with disk/load/swap data |
| 0.2 | logrotate -d /etc/logrotate.d/bytelyst-vm exits 0; logs present in /var/log |
| 0.3 | Root cause of 22 GB/day writes identified + alert configured |
| 1.1 | Dashboard /vm shows "Last cleanup: [date], freed [MB]" parsed from log |
| 1.2 | vm-health-check.sh includes steal % in output; Telegram sends steal alert at > 5% |
| 1.3 | Dashboard shows each unhealthy container's last health log + restart button works |
| 2.1 | Watchdog restarts an intentionally-broken test container within 30 min |
| 2.2 | hermes-root-backup runs 10 times without failure after fix deployed |
| 2.3 | All containers show memory limits in docker inspect; 48h with 0 OOMKill events |
| 3.1 | Score card renders live score; each dimension links to its detail |
| 4.1 | /api/vm/metrics/trend/disk?range=7d returns valid Prometheus time-series JSON |
| 4.3 | Telegram receives weekly digest on Monday 08:00 UTC |
What This Roadmap Delivers
| Today | After roadmap |
|---|---|
/api/vm/health silently fails |
VM module works; health data feeds dashboard |
| 8.2% steal is invisible | Daily alert + trend chart + score card dimension |
| "7 unhealthy" — no context, no fix | Drill-down to health log; auto-restart within 30 min |
| Cleanup log is a raw text dump | Structured panel: when, what, how much freed |
| invttrdg writing 22 GB/day — undetected | I/O alert + investigation complete |
| No memory guardrails on 39 containers | Per-container limits; OOM events alerted |
| 2 weeks of Prometheus data — no UI | Trend charts: disk projection, swap, steal over time |
| Manual VM diagnosis = 30 min SSH session | Score card auto-refreshes every 5 min |
| Ollama loads silently, may cause swap storm | RAM impact warning before load |
Change Log (v1 → v2)
| # | What changed | Why |
|---|---|---|
| 1 | Added Phase 0 (fix broken foundations) | devops-backend VM module non-functional; must fix first |
| 2 | Phase 4.1 changed from Cosmos DB → Prometheus queries | Prometheus already running with 2 weeks of history; Cosmos would be duplicate |
| 3 | Phase 2.1 restart explanation corrected | unless-stopped does not react to health check failures; process is alive |
| 4 | Phase 1.2 steal time corrected | Requires 2 samples 1s apart, not single /proc/stat read |
| 5 | Phase 2.3 memory limits validated against actual RSS data | Prevents proposing limits that would OOM running services |
| 6 | Phase 5 added invttrdg I/O investigation + Grafana option | 22 GB/day block writes is the highest-risk untracked issue on the machine |
| 7 | Added Phase 0.2 logrotate for new log files | /var/log/docker-watchdog.log would grow unbounded |
| 8 | Added architectural decisions section (Prometheus vs Cosmos, host exec strategy) | Prevents wasted build on wrong approach |
| 9 | Added success criteria per phase | Makes "done" objective and testable |
| 10 | Added explicit phase dependency map | Phase 3 items would fail if built before Phase 0 |
| 11 | Corrected LLM status: qwen2.5-coder:1.5b is currently loaded |
ollama ps confirmed; not idle as v1 stated |