# VM Observability & Control Roadmap — v2 **Status:** Draft — Pending Approval **Last updated:** 2026-05-27 **Scope:** `srv1491630` (Hostinger VM) + DevOps Dashboard (`devops.bytelyst.com`) **Reviewed:** Yes — v1 audited against live system; 11 issues corrected (see change log at bottom) --- ## Current State Snapshot | Layer | What exists today | Verified gap | |---|---|---| | **Health check** | `vm-health-check.sh` — disk, load, RAM, swap, Docker | No steal time metric; no per-container detail | | **Cleanup** | `vm-cleanup.sh` — build cache, images, logs, apt, pnpm, HOLD | Runs silently; no structured outcome record | | **Cron** | 4 scheduled jobs (daily / weekly / monthly) | No execution history; no "last ran / freed X" | | **Dashboard /vm** | Health check + cleanup log tail + trigger button | **VM module is non-functional** — container has no host volume mounts; all backend calls to host scripts fail silently | | **Dashboard /system** | CPU, RAM, disk, Docker stats | Missing steal %, container detail, unhealthy drill-down | | **Prometheus stack** | Prometheus + cAdvisor + node-exporter + Loki — ~2 weeks history | **No Grafana**; trend data exists but no UI to query it | | **Alerting** | Telegram on WARN/CRIT at 07:00 UTC | No steal time alert; no weekly digest; no cron failure alert | | **Container restart** | 38/39 containers have `unless-stopped` | `unless-stopped` restarts on *process exit only* — does NOT react to health check failures. 7 containers running but unhealthy (process alive, health endpoint dead) | | **LLMs (Ollama)** | 9 models on disk; `qwen2.5-coder:1.5b` currently loaded (1.1 GB, 100% CPU) | No RAM impact warning before loading; no dashboard visibility | | **I/O anomaly** | `invttrdg-backend` writing ~22 GB/day to block storage | Unexplained — no alert, no investigation | --- ## Architectural Decisions (settle these before building) ### A. Trend chart data source **Options:** - ✅ **Query existing Prometheus** from DevOps dashboard (recommended) — data already there, no new store needed. Add Prometheus query endpoints to dashboard backend, render with a chart library. - ➕ **Add Grafana container** alongside Prometheus — purpose-built for metrics UI, out-of-box dashboards. Extra 80–150 MB RAM. - ❌ **New Cosmos DB vm-metrics container** — redundant with Prometheus; wrong tool for time-series. **Recommendation:** Query Prometheus from the dashboard for Phase 4.2 charts (keeps everything in one UI). Add Grafana in Phase 5 only if dashboard charts feel limiting. ### B. Dashboard → host script execution The `devops-backend` container currently has **no host volume mounts** and **no sudoers entry**. Phase 3.2 "Run cleanup from dashboard" requires one of: - ✅ **Mount host script + Docker socket** into devops-backend (simplest, lowest risk) - ➕ **Thin host-side agent** (systemd socket-activated, receives commands via Unix socket) - ❌ **SSH from container to host** — unnecessary complexity **Recommendation:** Mount `/opt/bytelyst/learning_ai_devops_tools/scripts` read-only + `/var/log` for log reading into devops-backend. Add sudoers entry for the cleanup script only. --- ## Phase 0 — Fix Broken Foundations *(Day 1–2, prerequisite for all UI phases)* These are not new features — they are bugs in the current system. #### 0.1 Fix devops-backend VM module (host volume mounts) **Problem:** `GET /api/vm/health`, `GET /api/vm/cleanup-log`, `POST /api/vm/cleanup` all fail because the container has no access to the host filesystem. **Fix:** Update `docker-compose.yml` for devops-backend: ```yaml volumes: - /opt/bytelyst/learning_ai_devops_tools/scripts:/scripts:ro - /var/log/vm-cleanup.log:/var/log/vm-cleanup.log:ro - /var/log/vm-health-check.log:/var/log/vm-health-check.log:ro ``` Update `repository.ts` to use `/scripts/VMs/HostingerVM/vm-cleanup.sh` path, or use env var `VM_SCRIPTS_PATH`. Add sudoers entry: `nobody ALL=(ALL) NOPASSWD: /scripts/VMs/HostingerVM/vm-cleanup.sh` **Risk:** Low. Read-only mounts for scripts, append-only for logs. **Validates:** Run `curl http://localhost:4004/api/vm/health` and confirm JSON response. #### 0.2 Add logrotate entry for new log files **Problem:** `/var/log/vm-cleanup.log` and `/var/log/vm-health-check.log` have no rotation policy. Will grow unbounded. **Fix:** Create `/etc/logrotate.d/bytelyst-vm`: ``` /var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log { weekly rotate 8 compress delaycompress missingok notifempty create 0644 root root } ``` #### 0.3 Investigate `invttrdg-backend` I/O anomaly **Problem:** 22.2 GB block writes in 13 hours (~1.7 GB/hr). At this rate: 40 GB/day, will fill the 123 GB free disk in ~3 days of heavy trading activity. **Fix path:** Check what's being written (WAL logs? tick data? verbose debug logging?). Likely a log level or persistence config issue. Add disk usage alert specific to this container. **Risk of not fixing:** Disk fills up, all services go down. --- ## Phase 1 — Observability Gaps *(Week 1)* Read-only additions to existing scripts and the `/vm` dashboard page. #### 1.1 Cron Job Execution History Panel **Where:** Dashboard `/vm` page — new "Maintenance Schedule" card **What:** Add `GET /api/vm/cron-status` endpoint that: 1. Parses crontab entries for the 4 managed jobs (look for `# bytelyst-vm-maintenance` block) 2. Parses `/var/log/vm-cleanup.log` into structured run objects: `{ timestamp, mode, diskBefore, diskAfter, freedMB, steps[], success }` 3. Calculates next run from cron expression **UI:** Table — job name | schedule | last run | freed | status | next run. Expandable row shows step-by-step log. **Dependency:** Requires Phase 0.1 (volume mount for log access). #### 1.2 CPU Steal Time Metric **Where:** `vm-health-check.sh` + dashboard `/vm` health cards **What:** Sample `/proc/stat` twice 1 second apart, compute steal %: ```bash read_steal() { awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat; } s1=$(read_steal); sleep 1; s2=$(read_steal) steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{ split(s1,a," "); split(s2,b," ") delta_steal=b[1]-a[1]; delta_total=b[2]-a[2] printf "%.1f", (delta_steal/delta_total)*100 }') ``` Thresholds: `> 5%` = WARN, `> 15%` = CRIT. **Why:** Currently at **8.2%** — silently degrading every API response and LLM inference call. **Dependency:** None. Self-contained script change. #### 1.3 Unhealthy Container Detail Panel **Where:** Dashboard `/vm` — expand container health card **What:** New `GET /api/vm/containers/unhealthy` endpoint: - Container name, `unhealthy` since (parse `docker inspect .State.Health.Log[0].Start`) - Last 3 health check log lines - Current restart count **UI:** Expandable per-container row with one-click restart button (calls existing or new `POST /api/vm/containers/:name/restart`). **Dependency:** Requires Phase 0.1. #### 1.4 Swap Pressure Indicator **Where:** `vm-health-check.sh` + dashboard **What:** Add `SwapCached` as secondary metric. High SwapCached relative to SwapUsed = system was recently under pressure even if swap looks ok now. Surface in daily Telegram alert even when overall = WARN not CRIT. **Threshold change:** Current `SWAP_USED_WARN_GB=1` triggers today (1.4 GB in use). Consider raising to `1.5` to reduce noise while keeping the `SwapCached > 200MB` as an early warning signal. --- ## Phase 2 — Self-Healing Automation *(Week 2)* Scripts that fix known recurring issues automatically. #### 2.1 Health-Check-Aware Container Watchdog **Why the existing policy isn't enough:** All 38 containers already have `unless-stopped`. That policy restarts on *container process exit* only. When the web server process is alive but the health check endpoint returns `Connection refused`, Docker marks the container `unhealthy` but **does not restart it** — it keeps running indefinitely broken. **Fix:** Systemd timer `docker-health-watchdog.timer` (runs every 10 minutes): ```bash #!/bin/bash # /usr/local/bin/docker-health-watchdog.sh UNHEALTHY=$(docker ps --filter health=unhealthy --format '{{.Names}}') for container in $UNHEALTHY; do # Only restart if unhealthy for at least 3 consecutive checks (30 min) failures=$(docker inspect "$container" | \ python3 -c "import json,sys; h=json.load(sys.stdin)[0]['State']['Health']['Log']; \ print(sum(1 for l in h[-3:] if l['ExitCode']!=0))") if [[ "$failures" -eq 3 ]]; then docker restart "$container" echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Auto-restarted: $container (unhealthy 3x)" \ >> /var/log/docker-watchdog.log # Telegram notify (reads token from $HERMES_HOME/.env) fi done ``` **Safety:** Never restarts a container that just became unhealthy (3-check cooldown). Logs every restart. Only targets health-check failures, not intentionally stopped containers. **Rollback:** `systemctl disable docker-health-watchdog.timer` #### 2.2 Fix `hermes-root-backup` Git Diverge **Current failure:** Git fast-forward fails every ~10 minutes since 16:25 today (~30+ silent failures). **Fix:** Patch the backup script to handle diverge gracefully: ```bash if ! git pull --ff-only 2>/dev/null; then # Log the diverge git log --oneline -3 HEAD > /tmp/hermes-diverge-before.txt git log --oneline -3 origin/main >> /tmp/hermes-diverge-before.txt # Try rebase first (preserves local commits if intentional) if ! git pull --rebase 2>/dev/null; then # If rebase fails, reset to origin (backup is the source of truth) git reset --hard origin/main notify_telegram "⚠️ hermes-root-backup: diverged branch reset to origin/main" fi fi ``` **Risk:** `git reset --hard` loses any local-only commits on the backup repo. Acceptable here because the backup script's job is to *push to* origin — local commits shouldn't exist. Add a pre-check: if local commits exist that aren't on origin, alert instead of resetting. #### 2.3 Container Memory Limits **Validated against actual RSS data (Phase 2 data collected 2026-05-27):** | Category | Current RSS | Proposed Limit | Reservation | Notes | |---|---|---|---|---| | Next.js web frontends | 17–37 MB | `256m` | `64m` | 7× headroom for webpack spikes | | Node/Fastify backends | 20–67 MB | `384m` | `128m` | Allows burst for LLM calls | | `invttrdg-backend` | 107 MB | `512m` | `256m` | High I/O service; watch after 0.3 | | `trading-backend` | 92 MB | `512m` | `256m` | Active algo trading service | | `platform-service` | 66 MB | `384m` | `128m` | Shared auth/platform layer | | CosmosDB emulator | 145 MB | `1g` | `512m` | Can spike on write bursts | | Prometheus | 57 MB | `256m` | `128m` | Stable but grows with series | | Loki | 53 MB | `256m` | `128m` | Log ingestion can spike | | Caddy | 27 MB | `128m` | `64m` | Proxy, very stable | | Valkey (Redis) | 3.5 MB | `128m` | `32m` | Cache, tiny | | Gitea | 79 MB | `512m` | `256m` | Git operations can spike | | Ollama | 130 MB idle | **No limit** | — | Must accommodate model load (up to 8 GB) | **Rollout strategy:** 1. Run `docker stats` baseline for 24h to confirm no container spikes beyond proposed limits 2. Apply limits per stack in docker-compose files (not `docker update` — recreate on next deploy) 3. Monitor for OOMKill events: `dmesg | grep -i oom` for 48h after rollout 4. **Never set limits on Ollama** — model loading is unpredictable and limits would kill inference --- ## Phase 3 — Dashboard Control Plane *(Weeks 3–4)* **Prerequisite for all Phase 3 items:** Phase 0.1 (host volume mount) must be complete. #### 3.1 VM Score Card (Automated) **Where:** Dashboard `/vm` — top summary widget, auto-refreshes every 5 min **Scoring algorithm (0–100):** ``` CPU efficiency: 20 pts (steal < 2% = 20, < 5% = 15, < 10% = 10, ≥ 10% = 5) Memory pressure: 20 pts (available > 6 GB = 20, > 3 GB = 15, > 1 GB = 5, else = 0) Disk health: 15 pts (< 40% used = 15, < 55% = 10, < 70% = 5, else = 0) Service health: 20 pts (0 unhealthy = 20, 1–2 = 15, 3–5 = 8, 6+ = 2) Maintenance hygiene: 15 pts (last cleanup < 7 days + freed > 0 = 15, < 30 days = 8, else = 0) LLM readiness: 10 pts (> 8 GB free RAM = 10, > 4 GB = 7, > 2 GB = 4, else = 1) ``` Score = sum. Display as gauge. Each dimension clickable to drill into its data. **Dependencies:** Phase 1.2 (steal time in health check output). #### 3.2 Cron Schedule & History Panel **Where:** Dashboard `/vm` — "Maintenance" tab **What:** - Live table: 4 cron jobs × (last run, result, freed MB, next scheduled, "Run now" button) - Last 30 cleanup runs as a sparkline: date vs MB freed - One-click trigger for weekly / monthly / dry-run **Backend endpoint:** `GET /api/vm/cron-status` — parse structured log + crontab **Dependency:** Phase 0.1 (volume mount), Phase 1.1 (structured log parser). #### 3.3 Container Management Panel **Where:** Dashboard `/vm` — "Containers" tab **What:** - Full list: name, stack, health status, uptime, CPU %, RAM, restart count - Filter chips: All | Unhealthy | No Memory Limit | By stack - Per-container: Restart, View last 50 log lines, Show health check history - Bulk: "Restart all unhealthy" with confirmation modal **New backend endpoints:** `GET /api/vm/containers`, `POST /api/vm/containers/:name/restart`, `GET /api/vm/containers/:name/logs` #### 3.4 Ollama / LLM Panel **Where:** Dashboard `/vm` — "Models" tab **What:** - Models list: name, size, last used timestamp - Currently loaded (from `ollama ps`): model name, RAM used, CPU %, expires in - RAM visualisation bar: [used by system] [model if loaded] [free] - Warning banner: "Loading llama3.2-vision (7.8 GB) will leave ~1.2 GB free — swap pressure likely" - Load / Unload model buttons **Backend endpoints:** `GET /api/vm/ollama/models`, `POST /api/vm/ollama/load`, `DELETE /api/vm/ollama/unload` **Note:** `qwen2.5-coder:1.5b` is currently loaded — confirmed via `ollama ps`. --- ## Phase 4 — Trend Analysis *(Weeks 5–6)* **Key architectural note:** Prometheus + cAdvisor + node-exporter are **already running and storing ~2 weeks of metrics history** including steal time, disk I/O, memory, container CPU/RAM. Do NOT create a separate Cosmos DB store. Query Prometheus directly. #### 4.1 Prometheus Query Endpoints in Dashboard Backend **Where:** New `GET /api/vm/metrics/trend` endpoint group **What:** Proxy queries to internal Prometheus (http://prometheus:9090 within Docker network): ``` /api/vm/metrics/trend/disk?range=7d → disk usage % over time /api/vm/metrics/trend/memory?range=7d → available RAM + swap used over time /api/vm/metrics/trend/steal?range=7d → CPU steal % over time (once 1.2 is deployed) /api/vm/metrics/trend/containers?range=7d → unhealthy container count over time /api/vm/metrics/trend/io?range=7d → block write rate (flag invttrdg spikes) ``` **Note:** `devops-backend` is on `dashboard_default` network, Prometheus is on `learning_ai_common_plat_default`. Either add devops-backend to Prometheus network, or expose Prometheus on a host port (internal only, not via Caddy). #### 4.2 Trend Charts on Dashboard **Where:** Dashboard `/vm` — collapsible "Trends" section below score card **What (7-day / 30-day toggle):** - Disk % over time + linear projection line → "estimated to hit 55% warning in X days" - Swap used over time (detect slow memory leak) - CPU steal % over time (detect host degradation trend) - Unhealthy container count per day - Block write rate: flag days with `invttrdg-backend` anomalies **Library recommendation:** Recharts (already likely in the Next.js project) or lightweight Chart.js wrapper. #### 4.3 Weekly Digest (Telegram) **Where:** New cron job — Monday 08:00 UTC — `vm-cleanup.sh --weekly-digest` **What:** ``` 📊 Weekly VM Digest — srv1491630 Week ending 2026-06-01 🖥 CPU Steal: 8.2% avg ⚠️ (host contention — escalate if > 10%) 💾 Disk: 37% (freed 257 MB this week via cleanup) 🧠 RAM: 10 GB free avg ✓ 🔄 Swap peak: 1.4 GB ⚠️ 🐳 Containers: 7 unhealthy (action required) 🤖 LLMs run: qwen2.5-coder:1.5b (3 sessions this week) 🧹 Cleanups: 1 standard, 0 full 📅 Next full: 2026-06-01 Top action: Restart 7 unhealthy web containers ``` **Dependency:** Phase 4.1 (needs Prometheus for weekly averages), Phase 1.2 (steal metric must be in Prometheus). --- ## Phase 5 — Advanced / Backlog | Item | Description | Trigger condition | |---|---|---| | **Add Grafana** | Container alongside Prometheus for richer dashboards; pre-built node-exporter dashboards available | Phase 4 charts feel limited | | **Deployment ↔ health correlation** | Mark deploys on trend charts; correlate health dips to specific releases | After Phase 4.2 exists | | **Multi-VM support** | Extend all above to aggregate across VMs | Adding second VM | | **`invttrdg-backend` write audit** | Persistent investigation: what generates 22 GB/day of block writes? Add per-container I/O alert | After Phase 0.3 | | **Chaos validation** | Monthly: watchdog stops a test container, verify restart within 10 min, report result | After Phase 2.1 | | **Ollama GPU readiness check** | Detect GPU availability, surface in LLM panel as "GPU: none — inference will be slow" | Before adding large models | | **Container image freshness** | Alert when container is running image > 30 days old (not rebuilt) | When deploy pipeline matures | | **Cost attribution** | Tag containers by product (trading, notes, clock...) — RAM/CPU cost per product | When billing needed | | **Backup health tracking** | `hermes-root-backup` and `uma-hermes-backup` results surfaced in dashboard | After Phase 2.2 | --- ## Implementation Order ``` Day 1–2 Phase 0 ── Fix broken foundations (VM module, logrotate, I/O investigation) ⚠️ MUST complete before any Phase 3 dashboard work Week 1 Phase 1 ── Observability (steal metric, cron history, unhealthy detail, swap) 1.2 (steal) → unblocks 3.1 (score card) 1.1 (cron log format) → unblocks 3.2 (cron panel) Week 2 Phase 2 ── Self-healing (watchdog, hermes-backup fix, memory limits) 2.1 requires: logrotate entry (Phase 0.2) 2.3 requires: 24h baseline observation first Weeks 3–4 Phase 3 ── Dashboard control (score card, cron panel, containers, Ollama) All require: Phase 0.1 (host volume mount) 3.1 requires: Phase 1.2 deployed 3.2 requires: Phase 1.1 deployed Weeks 5–6 Phase 4 ── Trend analysis (Prometheus queries, charts, weekly digest) 4.1 requires: devops-backend on same Docker network as Prometheus 4.2 requires: Phase 4.1 4.3 requires: Phase 4.1 + Phase 1.2 Backlog Phase 5 ── Advanced items, trigger-based ``` --- ## Success Criteria (how to know each phase is done) | Phase | Done when… | |---|---| | 0.1 | `curl localhost:4004/api/vm/health` returns valid JSON with disk/load/swap data | | 0.2 | `logrotate -d /etc/logrotate.d/bytelyst-vm` exits 0; logs present in `/var/log` | | 0.3 | Root cause of 22 GB/day writes identified + alert configured | | 1.1 | Dashboard `/vm` shows "Last cleanup: [date], freed [MB]" parsed from log | | 1.2 | `vm-health-check.sh` includes steal % in output; Telegram sends steal alert at > 5% | | 1.3 | Dashboard shows each unhealthy container's last health log + restart button works | | 2.1 | Watchdog restarts an intentionally-broken test container within 30 min | | 2.2 | `hermes-root-backup` runs 10 times without failure after fix deployed | | 2.3 | All containers show memory limits in `docker inspect`; 48h with 0 OOMKill events | | 3.1 | Score card renders live score; each dimension links to its detail | | 4.1 | `/api/vm/metrics/trend/disk?range=7d` returns valid Prometheus time-series JSON | | 4.3 | Telegram receives weekly digest on Monday 08:00 UTC | --- ## What This Roadmap Delivers | Today | After roadmap | |---|---| | `/api/vm/health` silently fails | VM module works; health data feeds dashboard | | 8.2% steal is invisible | Daily alert + trend chart + score card dimension | | "7 unhealthy" — no context, no fix | Drill-down to health log; auto-restart within 30 min | | Cleanup log is a raw text dump | Structured panel: when, what, how much freed | | invttrdg writing 22 GB/day — undetected | I/O alert + investigation complete | | No memory guardrails on 39 containers | Per-container limits; OOM events alerted | | 2 weeks of Prometheus data — no UI | Trend charts: disk projection, swap, steal over time | | Manual VM diagnosis = 30 min SSH session | Score card auto-refreshes every 5 min | | Ollama loads silently, may cause swap storm | RAM impact warning before load | --- ## Change Log (v1 → v2) | # | What changed | Why | |---|---|---| | 1 | Added **Phase 0** (fix broken foundations) | devops-backend VM module non-functional; must fix first | | 2 | Phase 4.1 changed from Cosmos DB → **Prometheus queries** | Prometheus already running with 2 weeks of history; Cosmos would be duplicate | | 3 | Phase 2.1 restart explanation corrected | `unless-stopped` does not react to health check failures; process is alive | | 4 | Phase 1.2 steal time corrected | Requires **2 samples** 1s apart, not single `/proc/stat` read | | 5 | Phase 2.3 memory limits **validated against actual RSS data** | Prevents proposing limits that would OOM running services | | 6 | Phase 5 added **invttrdg I/O investigation** + Grafana option | 22 GB/day block writes is the highest-risk untracked issue on the machine | | 7 | Added Phase 0.2 **logrotate** for new log files | `/var/log/docker-watchdog.log` would grow unbounded | | 8 | Added **architectural decisions** section (Prometheus vs Cosmos, host exec strategy) | Prevents wasted build on wrong approach | | 9 | Added **success criteria** per phase | Makes "done" objective and testable | | 10 | Added explicit **phase dependency map** | Phase 3 items would fail if built before Phase 0 | | 11 | Corrected LLM status: `qwen2.5-coder:1.5b` **is currently loaded** | `ollama ps` confirmed; not idle as v1 stated |