bytelyst-devops-tools/docs/VM_OBSERVABILITY_ROADMAP.md
Hermes VM 2fc23d6baa feat(vm): fix devops-backend VM module — Phase 0.1 complete
- Switch backend runner from node:20-alpine to node:20-slim so GNU df
  flags (--output=pcent/avail) work inside the container
- Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw),
  docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars
- Rebuild repository.ts: env-configurable paths, cron history parser,
  unhealthy-container inspector, Ollama model endpoints
- Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama
  models, container restart, model unload
- vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix
  (( count++ )) → count=$(( count + 1 )) for set -e compatibility
- Add docs/VM_OBSERVABILITY_ROADMAP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:13:45 +00:00

22 KiB
Raw Blame History

VM Observability & Control Roadmap — v2

Status: Draft — Pending Approval Last updated: 2026-05-27 Scope: srv1491630 (Hostinger VM) + DevOps Dashboard (devops.bytelyst.com) Reviewed: Yes — v1 audited against live system; 11 issues corrected (see change log at bottom)


Current State Snapshot

Layer What exists today Verified gap
Health check vm-health-check.sh — disk, load, RAM, swap, Docker No steal time metric; no per-container detail
Cleanup vm-cleanup.sh — build cache, images, logs, apt, pnpm, HOLD Runs silently; no structured outcome record
Cron 4 scheduled jobs (daily / weekly / monthly) No execution history; no "last ran / freed X"
Dashboard /vm Health check + cleanup log tail + trigger button VM module is non-functional — container has no host volume mounts; all backend calls to host scripts fail silently
Dashboard /system CPU, RAM, disk, Docker stats Missing steal %, container detail, unhealthy drill-down
Prometheus stack Prometheus + cAdvisor + node-exporter + Loki — ~2 weeks history No Grafana; trend data exists but no UI to query it
Alerting Telegram on WARN/CRIT at 07:00 UTC No steal time alert; no weekly digest; no cron failure alert
Container restart 38/39 containers have unless-stopped unless-stopped restarts on process exit only — does NOT react to health check failures. 7 containers running but unhealthy (process alive, health endpoint dead)
LLMs (Ollama) 9 models on disk; qwen2.5-coder:1.5b currently loaded (1.1 GB, 100% CPU) No RAM impact warning before loading; no dashboard visibility
I/O anomaly invttrdg-backend writing ~22 GB/day to block storage Unexplained — no alert, no investigation

Architectural Decisions (settle these before building)

A. Trend chart data source

Options:

  • Query existing Prometheus from DevOps dashboard (recommended) — data already there, no new store needed. Add Prometheus query endpoints to dashboard backend, render with a chart library.
  • Add Grafana container alongside Prometheus — purpose-built for metrics UI, out-of-box dashboards. Extra 80150 MB RAM.
  • New Cosmos DB vm-metrics container — redundant with Prometheus; wrong tool for time-series.

Recommendation: Query Prometheus from the dashboard for Phase 4.2 charts (keeps everything in one UI). Add Grafana in Phase 5 only if dashboard charts feel limiting.

B. Dashboard → host script execution

The devops-backend container currently has no host volume mounts and no sudoers entry. Phase 3.2 "Run cleanup from dashboard" requires one of:

  • Mount host script + Docker socket into devops-backend (simplest, lowest risk)
  • Thin host-side agent (systemd socket-activated, receives commands via Unix socket)
  • SSH from container to host — unnecessary complexity

Recommendation: Mount /opt/bytelyst/learning_ai_devops_tools/scripts read-only + /var/log for log reading into devops-backend. Add sudoers entry for the cleanup script only.


Phase 0 — Fix Broken Foundations (Day 12, prerequisite for all UI phases)

These are not new features — they are bugs in the current system.

0.1 Fix devops-backend VM module (host volume mounts)

Problem: GET /api/vm/health, GET /api/vm/cleanup-log, POST /api/vm/cleanup all fail because the container has no access to the host filesystem. Fix: Update docker-compose.yml for devops-backend:

volumes:
  - /opt/bytelyst/learning_ai_devops_tools/scripts:/scripts:ro
  - /var/log/vm-cleanup.log:/var/log/vm-cleanup.log:ro
  - /var/log/vm-health-check.log:/var/log/vm-health-check.log:ro

Update repository.ts to use /scripts/VMs/HostingerVM/vm-cleanup.sh path, or use env var VM_SCRIPTS_PATH. Add sudoers entry: nobody ALL=(ALL) NOPASSWD: /scripts/VMs/HostingerVM/vm-cleanup.sh Risk: Low. Read-only mounts for scripts, append-only for logs. Validates: Run curl http://localhost:4004/api/vm/health and confirm JSON response.

0.2 Add logrotate entry for new log files

Problem: /var/log/vm-cleanup.log and /var/log/vm-health-check.log have no rotation policy. Will grow unbounded. Fix: Create /etc/logrotate.d/bytelyst-vm:

/var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log {
    weekly
    rotate 8
    compress
    delaycompress
    missingok
    notifempty
    create 0644 root root
}

0.3 Investigate invttrdg-backend I/O anomaly

Problem: 22.2 GB block writes in 13 hours (~1.7 GB/hr). At this rate: 40 GB/day, will fill the 123 GB free disk in ~3 days of heavy trading activity. Fix path: Check what's being written (WAL logs? tick data? verbose debug logging?). Likely a log level or persistence config issue. Add disk usage alert specific to this container. Risk of not fixing: Disk fills up, all services go down.


Phase 1 — Observability Gaps (Week 1)

Read-only additions to existing scripts and the /vm dashboard page.

1.1 Cron Job Execution History Panel

Where: Dashboard /vm page — new "Maintenance Schedule" card What: Add GET /api/vm/cron-status endpoint that:

  1. Parses crontab entries for the 4 managed jobs (look for # bytelyst-vm-maintenance block)
  2. Parses /var/log/vm-cleanup.log into structured run objects: { timestamp, mode, diskBefore, diskAfter, freedMB, steps[], success }
  3. Calculates next run from cron expression

UI: Table — job name | schedule | last run | freed | status | next run. Expandable row shows step-by-step log. Dependency: Requires Phase 0.1 (volume mount for log access).

1.2 CPU Steal Time Metric

Where: vm-health-check.sh + dashboard /vm health cards What: Sample /proc/stat twice 1 second apart, compute steal %:

read_steal() { awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat; }
s1=$(read_steal); sleep 1; s2=$(read_steal)
steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{
  split(s1,a," "); split(s2,b," ")
  delta_steal=b[1]-a[1]; delta_total=b[2]-a[2]
  printf "%.1f", (delta_steal/delta_total)*100
}')

Thresholds: > 5% = WARN, > 15% = CRIT. Why: Currently at 8.2% — silently degrading every API response and LLM inference call. Dependency: None. Self-contained script change.

1.3 Unhealthy Container Detail Panel

Where: Dashboard /vm — expand container health card What: New GET /api/vm/containers/unhealthy endpoint:

  • Container name, unhealthy since (parse docker inspect .State.Health.Log[0].Start)
  • Last 3 health check log lines
  • Current restart count

UI: Expandable per-container row with one-click restart button (calls existing or new POST /api/vm/containers/:name/restart). Dependency: Requires Phase 0.1.

1.4 Swap Pressure Indicator

Where: vm-health-check.sh + dashboard What: Add SwapCached as secondary metric. High SwapCached relative to SwapUsed = system was recently under pressure even if swap looks ok now. Surface in daily Telegram alert even when overall = WARN not CRIT. Threshold change: Current SWAP_USED_WARN_GB=1 triggers today (1.4 GB in use). Consider raising to 1.5 to reduce noise while keeping the SwapCached > 200MB as an early warning signal.


Phase 2 — Self-Healing Automation (Week 2)

Scripts that fix known recurring issues automatically.

2.1 Health-Check-Aware Container Watchdog

Why the existing policy isn't enough: All 38 containers already have unless-stopped. That policy restarts on container process exit only. When the web server process is alive but the health check endpoint returns Connection refused, Docker marks the container unhealthy but does not restart it — it keeps running indefinitely broken. Fix: Systemd timer docker-health-watchdog.timer (runs every 10 minutes):

#!/bin/bash
# /usr/local/bin/docker-health-watchdog.sh
UNHEALTHY=$(docker ps --filter health=unhealthy --format '{{.Names}}')
for container in $UNHEALTHY; do
  # Only restart if unhealthy for at least 3 consecutive checks (30 min)
  failures=$(docker inspect "$container" | \
    python3 -c "import json,sys; h=json.load(sys.stdin)[0]['State']['Health']['Log']; \
    print(sum(1 for l in h[-3:] if l['ExitCode']!=0))")
  if [[ "$failures" -eq 3 ]]; then
    docker restart "$container"
    echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Auto-restarted: $container (unhealthy 3x)" \
      >> /var/log/docker-watchdog.log
    # Telegram notify (reads token from $HERMES_HOME/.env)
  fi
done

Safety: Never restarts a container that just became unhealthy (3-check cooldown). Logs every restart. Only targets health-check failures, not intentionally stopped containers. Rollback: systemctl disable docker-health-watchdog.timer

2.2 Fix hermes-root-backup Git Diverge

Current failure: Git fast-forward fails every ~10 minutes since 16:25 today (~30+ silent failures). Fix: Patch the backup script to handle diverge gracefully:

if ! git pull --ff-only 2>/dev/null; then
    # Log the diverge
    git log --oneline -3 HEAD > /tmp/hermes-diverge-before.txt
    git log --oneline -3 origin/main >> /tmp/hermes-diverge-before.txt
    # Try rebase first (preserves local commits if intentional)
    if ! git pull --rebase 2>/dev/null; then
        # If rebase fails, reset to origin (backup is the source of truth)
        git reset --hard origin/main
        notify_telegram "⚠️ hermes-root-backup: diverged branch reset to origin/main"
    fi
fi

Risk: git reset --hard loses any local-only commits on the backup repo. Acceptable here because the backup script's job is to push to origin — local commits shouldn't exist. Add a pre-check: if local commits exist that aren't on origin, alert instead of resetting.

2.3 Container Memory Limits

Validated against actual RSS data (Phase 2 data collected 2026-05-27):

Category Current RSS Proposed Limit Reservation Notes
Next.js web frontends 1737 MB 256m 64m 7× headroom for webpack spikes
Node/Fastify backends 2067 MB 384m 128m Allows burst for LLM calls
invttrdg-backend 107 MB 512m 256m High I/O service; watch after 0.3
trading-backend 92 MB 512m 256m Active algo trading service
platform-service 66 MB 384m 128m Shared auth/platform layer
CosmosDB emulator 145 MB 1g 512m Can spike on write bursts
Prometheus 57 MB 256m 128m Stable but grows with series
Loki 53 MB 256m 128m Log ingestion can spike
Caddy 27 MB 128m 64m Proxy, very stable
Valkey (Redis) 3.5 MB 128m 32m Cache, tiny
Gitea 79 MB 512m 256m Git operations can spike
Ollama 130 MB idle No limit Must accommodate model load (up to 8 GB)

Rollout strategy:

  1. Run docker stats baseline for 24h to confirm no container spikes beyond proposed limits
  2. Apply limits per stack in docker-compose files (not docker update — recreate on next deploy)
  3. Monitor for OOMKill events: dmesg | grep -i oom for 48h after rollout
  4. Never set limits on Ollama — model loading is unpredictable and limits would kill inference

Phase 3 — Dashboard Control Plane (Weeks 34)

Prerequisite for all Phase 3 items: Phase 0.1 (host volume mount) must be complete.

3.1 VM Score Card (Automated)

Where: Dashboard /vm — top summary widget, auto-refreshes every 5 min Scoring algorithm (0100):

CPU efficiency:     20 pts  (steal < 2% = 20, < 5% = 15, < 10% = 10, ≥ 10% = 5)
Memory pressure:    20 pts  (available > 6 GB = 20, > 3 GB = 15, > 1 GB = 5, else = 0)
Disk health:        15 pts  (< 40% used = 15, < 55% = 10, < 70% = 5, else = 0)
Service health:     20 pts  (0 unhealthy = 20, 12 = 15, 35 = 8, 6+ = 2)
Maintenance hygiene: 15 pts (last cleanup < 7 days + freed > 0 = 15, < 30 days = 8, else = 0)
LLM readiness:      10 pts  (> 8 GB free RAM = 10, > 4 GB = 7, > 2 GB = 4, else = 1)

Score = sum. Display as gauge. Each dimension clickable to drill into its data. Dependencies: Phase 1.2 (steal time in health check output).

3.2 Cron Schedule & History Panel

Where: Dashboard /vm — "Maintenance" tab What:

  • Live table: 4 cron jobs × (last run, result, freed MB, next scheduled, "Run now" button)
  • Last 30 cleanup runs as a sparkline: date vs MB freed
  • One-click trigger for weekly / monthly / dry-run

Backend endpoint: GET /api/vm/cron-status — parse structured log + crontab Dependency: Phase 0.1 (volume mount), Phase 1.1 (structured log parser).

3.3 Container Management Panel

Where: Dashboard /vm — "Containers" tab What:

  • Full list: name, stack, health status, uptime, CPU %, RAM, restart count
  • Filter chips: All | Unhealthy | No Memory Limit | By stack
  • Per-container: Restart, View last 50 log lines, Show health check history
  • Bulk: "Restart all unhealthy" with confirmation modal

New backend endpoints: GET /api/vm/containers, POST /api/vm/containers/:name/restart, GET /api/vm/containers/:name/logs

3.4 Ollama / LLM Panel

Where: Dashboard /vm — "Models" tab What:

  • Models list: name, size, last used timestamp
  • Currently loaded (from ollama ps): model name, RAM used, CPU %, expires in
  • RAM visualisation bar: [used by system] [model if loaded] [free]
  • Warning banner: "Loading llama3.2-vision (7.8 GB) will leave ~1.2 GB free — swap pressure likely"
  • Load / Unload model buttons

Backend endpoints: GET /api/vm/ollama/models, POST /api/vm/ollama/load, DELETE /api/vm/ollama/unload Note: qwen2.5-coder:1.5b is currently loaded — confirmed via ollama ps.


Phase 4 — Trend Analysis (Weeks 56)

Key architectural note: Prometheus + cAdvisor + node-exporter are already running and storing ~2 weeks of metrics history including steal time, disk I/O, memory, container CPU/RAM. Do NOT create a separate Cosmos DB store. Query Prometheus directly.

4.1 Prometheus Query Endpoints in Dashboard Backend

Where: New GET /api/vm/metrics/trend endpoint group What: Proxy queries to internal Prometheus (http://prometheus:9090 within Docker network):

/api/vm/metrics/trend/disk?range=7d       → disk usage % over time
/api/vm/metrics/trend/memory?range=7d     → available RAM + swap used over time
/api/vm/metrics/trend/steal?range=7d      → CPU steal % over time (once 1.2 is deployed)
/api/vm/metrics/trend/containers?range=7d → unhealthy container count over time
/api/vm/metrics/trend/io?range=7d         → block write rate (flag invttrdg spikes)

Note: devops-backend is on dashboard_default network, Prometheus is on learning_ai_common_plat_default. Either add devops-backend to Prometheus network, or expose Prometheus on a host port (internal only, not via Caddy).

4.2 Trend Charts on Dashboard

Where: Dashboard /vm — collapsible "Trends" section below score card What (7-day / 30-day toggle):

  • Disk % over time + linear projection line → "estimated to hit 55% warning in X days"
  • Swap used over time (detect slow memory leak)
  • CPU steal % over time (detect host degradation trend)
  • Unhealthy container count per day
  • Block write rate: flag days with invttrdg-backend anomalies

Library recommendation: Recharts (already likely in the Next.js project) or lightweight Chart.js wrapper.

4.3 Weekly Digest (Telegram)

Where: New cron job — Monday 08:00 UTC — vm-cleanup.sh --weekly-digest What:

📊 Weekly VM Digest — srv1491630
Week ending 2026-06-01

🖥 CPU Steal:  8.2% avg  ⚠️ (host contention — escalate if > 10%)
💾 Disk:       37% (freed 257 MB this week via cleanup)
🧠 RAM:        10 GB free avg  ✓
🔄 Swap peak:  1.4 GB  ⚠️
🐳 Containers: 7 unhealthy (action required)
🤖 LLMs run:   qwen2.5-coder:1.5b (3 sessions this week)
🧹 Cleanups:   1 standard, 0 full
📅 Next full:  2026-06-01

Top action: Restart 7 unhealthy web containers

Dependency: Phase 4.1 (needs Prometheus for weekly averages), Phase 1.2 (steal metric must be in Prometheus).


Phase 5 — Advanced / Backlog

Item Description Trigger condition
Add Grafana Container alongside Prometheus for richer dashboards; pre-built node-exporter dashboards available Phase 4 charts feel limited
Deployment ↔ health correlation Mark deploys on trend charts; correlate health dips to specific releases After Phase 4.2 exists
Multi-VM support Extend all above to aggregate across VMs Adding second VM
invttrdg-backend write audit Persistent investigation: what generates 22 GB/day of block writes? Add per-container I/O alert After Phase 0.3
Chaos validation Monthly: watchdog stops a test container, verify restart within 10 min, report result After Phase 2.1
Ollama GPU readiness check Detect GPU availability, surface in LLM panel as "GPU: none — inference will be slow" Before adding large models
Container image freshness Alert when container is running image > 30 days old (not rebuilt) When deploy pipeline matures
Cost attribution Tag containers by product (trading, notes, clock...) — RAM/CPU cost per product When billing needed
Backup health tracking hermes-root-backup and uma-hermes-backup results surfaced in dashboard After Phase 2.2

Implementation Order

Day 12   Phase 0  ── Fix broken foundations (VM module, logrotate, I/O investigation)
                       ⚠️ MUST complete before any Phase 3 dashboard work

Week 1    Phase 1  ── Observability (steal metric, cron history, unhealthy detail, swap)
                       1.2 (steal) → unblocks 3.1 (score card)
                       1.1 (cron log format) → unblocks 3.2 (cron panel)

Week 2    Phase 2  ── Self-healing (watchdog, hermes-backup fix, memory limits)
                       2.1 requires: logrotate entry (Phase 0.2)
                       2.3 requires: 24h baseline observation first

Weeks 34 Phase 3  ── Dashboard control (score card, cron panel, containers, Ollama)
                       All require: Phase 0.1 (host volume mount)
                       3.1 requires: Phase 1.2 deployed
                       3.2 requires: Phase 1.1 deployed

Weeks 56 Phase 4  ── Trend analysis (Prometheus queries, charts, weekly digest)
                       4.1 requires: devops-backend on same Docker network as Prometheus
                       4.2 requires: Phase 4.1
                       4.3 requires: Phase 4.1 + Phase 1.2

Backlog   Phase 5  ── Advanced items, trigger-based

Success Criteria (how to know each phase is done)

Phase Done when…
0.1 curl localhost:4004/api/vm/health returns valid JSON with disk/load/swap data
0.2 logrotate -d /etc/logrotate.d/bytelyst-vm exits 0; logs present in /var/log
0.3 Root cause of 22 GB/day writes identified + alert configured
1.1 Dashboard /vm shows "Last cleanup: [date], freed [MB]" parsed from log
1.2 vm-health-check.sh includes steal % in output; Telegram sends steal alert at > 5%
1.3 Dashboard shows each unhealthy container's last health log + restart button works
2.1 Watchdog restarts an intentionally-broken test container within 30 min
2.2 hermes-root-backup runs 10 times without failure after fix deployed
2.3 All containers show memory limits in docker inspect; 48h with 0 OOMKill events
3.1 Score card renders live score; each dimension links to its detail
4.1 /api/vm/metrics/trend/disk?range=7d returns valid Prometheus time-series JSON
4.3 Telegram receives weekly digest on Monday 08:00 UTC

What This Roadmap Delivers

Today After roadmap
/api/vm/health silently fails VM module works; health data feeds dashboard
8.2% steal is invisible Daily alert + trend chart + score card dimension
"7 unhealthy" — no context, no fix Drill-down to health log; auto-restart within 30 min
Cleanup log is a raw text dump Structured panel: when, what, how much freed
invttrdg writing 22 GB/day — undetected I/O alert + investigation complete
No memory guardrails on 39 containers Per-container limits; OOM events alerted
2 weeks of Prometheus data — no UI Trend charts: disk projection, swap, steal over time
Manual VM diagnosis = 30 min SSH session Score card auto-refreshes every 5 min
Ollama loads silently, may cause swap storm RAM impact warning before load

Change Log (v1 → v2)

# What changed Why
1 Added Phase 0 (fix broken foundations) devops-backend VM module non-functional; must fix first
2 Phase 4.1 changed from Cosmos DB → Prometheus queries Prometheus already running with 2 weeks of history; Cosmos would be duplicate
3 Phase 2.1 restart explanation corrected unless-stopped does not react to health check failures; process is alive
4 Phase 1.2 steal time corrected Requires 2 samples 1s apart, not single /proc/stat read
5 Phase 2.3 memory limits validated against actual RSS data Prevents proposing limits that would OOM running services
6 Phase 5 added invttrdg I/O investigation + Grafana option 22 GB/day block writes is the highest-risk untracked issue on the machine
7 Added Phase 0.2 logrotate for new log files /var/log/docker-watchdog.log would grow unbounded
8 Added architectural decisions section (Prometheus vs Cosmos, host exec strategy) Prevents wasted build on wrong approach
9 Added success criteria per phase Makes "done" objective and testable
10 Added explicit phase dependency map Phase 3 items would fail if built before Phase 0
11 Corrected LLM status: qwen2.5-coder:1.5b is currently loaded ollama ps confirmed; not idle as v1 stated