Hermes VM 2fc23d6baa feat(vm): fix devops-backend VM module — Phase 0.1 complete

- Switch backend runner from node:20-alpine to node:20-slim so GNU df
  flags (--output=pcent/avail) work inside the container
- Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw),
  docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars
- Rebuild repository.ts: env-configurable paths, cron history parser,
  unhealthy-container inspector, Ollama model endpoints
- Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama
  models, container restart, model unload
- vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix
  (( count++ )) → count=$(( count + 1 )) for set -e compatibility
- Add docs/VM_OBSERVABILITY_ROADMAP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-27 21:13:45 +00:00

22 KiB

Raw Blame History

VM Observability & Control Roadmap — v2

Status: Draft — Pending Approval Last updated: 2026-05-27 Scope: srv1491630 (Hostinger VM) + DevOps Dashboard (devops.bytelyst.com) Reviewed: Yes — v1 audited against live system; 11 issues corrected (see change log at bottom)

Current State Snapshot

Layer	What exists today	Verified gap
Health check	`vm-health-check.sh` — disk, load, RAM, swap, Docker	No steal time metric; no per-container detail
Cleanup	`vm-cleanup.sh` — build cache, images, logs, apt, pnpm, HOLD	Runs silently; no structured outcome record
Cron	4 scheduled jobs (daily / weekly / monthly)	No execution history; no "last ran / freed X"
Dashboard /vm	Health check + cleanup log tail + trigger button	VM module is non-functional — container has no host volume mounts; all backend calls to host scripts fail silently
Dashboard /system	CPU, RAM, disk, Docker stats	Missing steal %, container detail, unhealthy drill-down
Prometheus stack	Prometheus + cAdvisor + node-exporter + Loki — ~2 weeks history	No Grafana; trend data exists but no UI to query it
Alerting	Telegram on WARN/CRIT at 07:00 UTC	No steal time alert; no weekly digest; no cron failure alert
Container restart	38/39 containers have `unless-stopped`	`unless-stopped` restarts on process exit only — does NOT react to health check failures. 7 containers running but unhealthy (process alive, health endpoint dead)
LLMs (Ollama)	9 models on disk; `qwen2.5-coder:1.5b` currently loaded (1.1 GB, 100% CPU)	No RAM impact warning before loading; no dashboard visibility
I/O anomaly	`invttrdg-backend` writing ~22 GB/day to block storage	Unexplained — no alert, no investigation

Architectural Decisions (settle these before building)

A. Trend chart data source

Options:

✅ Query existing Prometheus from DevOps dashboard (recommended) — data already there, no new store needed. Add Prometheus query endpoints to dashboard backend, render with a chart library.
➕ Add Grafana container alongside Prometheus — purpose-built for metrics UI, out-of-box dashboards. Extra 80–150 MB RAM.
❌ New Cosmos DB vm-metrics container — redundant with Prometheus; wrong tool for time-series.

Recommendation: Query Prometheus from the dashboard for Phase 4.2 charts (keeps everything in one UI). Add Grafana in Phase 5 only if dashboard charts feel limiting.

B. Dashboard → host script execution

The devops-backend container currently has no host volume mounts and no sudoers entry. Phase 3.2 "Run cleanup from dashboard" requires one of:

✅ Mount host script + Docker socket into devops-backend (simplest, lowest risk)
➕ Thin host-side agent (systemd socket-activated, receives commands via Unix socket)
❌ SSH from container to host — unnecessary complexity

Recommendation: Mount /opt/bytelyst/learning_ai_devops_tools/scripts read-only + /var/log for log reading into devops-backend. Add sudoers entry for the cleanup script only.

Phase 0 — Fix Broken Foundations (Day 1–2, prerequisite for all UI phases)

These are not new features — they are bugs in the current system.

0.1 Fix devops-backend VM module (host volume mounts)

Problem: GET /api/vm/health, GET /api/vm/cleanup-log, POST /api/vm/cleanup all fail because the container has no access to the host filesystem. Fix: Update docker-compose.yml for devops-backend:

volumes:
  - /opt/bytelyst/learning_ai_devops_tools/scripts:/scripts:ro
  - /var/log/vm-cleanup.log:/var/log/vm-cleanup.log:ro
  - /var/log/vm-health-check.log:/var/log/vm-health-check.log:ro

Update repository.ts to use /scripts/VMs/HostingerVM/vm-cleanup.sh path, or use env var VM_SCRIPTS_PATH. Add sudoers entry: nobody ALL=(ALL) NOPASSWD: /scripts/VMs/HostingerVM/vm-cleanup.sh Risk: Low. Read-only mounts for scripts, append-only for logs. Validates: Run curl http://localhost:4004/api/vm/health and confirm JSON response.

0.2 Add logrotate entry for new log files

Problem: /var/log/vm-cleanup.log and /var/log/vm-health-check.log have no rotation policy. Will grow unbounded. Fix: Create /etc/logrotate.d/bytelyst-vm:

/var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log {
    weekly
    rotate 8
    compress
    delaycompress
    missingok
    notifempty
    create 0644 root root
}

0.3 Investigate `invttrdg-backend` I/O anomaly

Problem: 22.2 GB block writes in 13 hours (~1.7 GB/hr). At this rate: 40 GB/day, will fill the 123 GB free disk in ~3 days of heavy trading activity. Fix path: Check what's being written (WAL logs? tick data? verbose debug logging?). Likely a log level or persistence config issue. Add disk usage alert specific to this container. Risk of not fixing: Disk fills up, all services go down.

Phase 1 — Observability Gaps (Week 1)

Read-only additions to existing scripts and the /vm dashboard page.

1.1 Cron Job Execution History Panel

Where: Dashboard /vm page — new "Maintenance Schedule" card What: Add GET /api/vm/cron-status endpoint that:

Parses crontab entries for the 4 managed jobs (look for # bytelyst-vm-maintenance block)
Parses /var/log/vm-cleanup.log into structured run objects: { timestamp, mode, diskBefore, diskAfter, freedMB, steps[], success }
Calculates next run from cron expression

1.2 CPU Steal Time Metric

Where: vm-health-check.sh + dashboard /vm health cards What: Sample /proc/stat twice 1 second apart, compute steal %:

read_steal() { awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat; }
s1=$(read_steal); sleep 1; s2=$(read_steal)
steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{
  split(s1,a," "); split(s2,b," ")
  delta_steal=b[1]-a[1]; delta_total=b[2]-a[2]
  printf "%.1f", (delta_steal/delta_total)*100
}')

Thresholds: > 5% = WARN, > 15% = CRIT. Why: Currently at 8.2% — silently degrading every API response and LLM inference call. Dependency: None. Self-contained script change.

1.3 Unhealthy Container Detail Panel

Where: Dashboard /vm — expand container health card What: New GET /api/vm/containers/unhealthy endpoint:

Container name, unhealthy since (parse docker inspect .State.Health.Log[0].Start)
Last 3 health check log lines
Current restart count

UI: Expandable per-container row with one-click restart button (calls existing or new POST /api/vm/containers/:name/restart). Dependency: Requires Phase 0.1.

1.4 Swap Pressure Indicator

Where: vm-health-check.sh + dashboard What: Add SwapCached as secondary metric. High SwapCached relative to SwapUsed = system was recently under pressure even if swap looks ok now. Surface in daily Telegram alert even when overall = WARN not CRIT. Threshold change: Current SWAP_USED_WARN_GB=1 triggers today (1.4 GB in use). Consider raising to 1.5 to reduce noise while keeping the SwapCached > 200MB as an early warning signal.

Phase 2 — Self-Healing Automation (Week 2)

Scripts that fix known recurring issues automatically.

2.1 Health-Check-Aware Container Watchdog

Why the existing policy isn't enough: All 38 containers already have unless-stopped. That policy restarts on container process exit only. When the web server process is alive but the health check endpoint returns Connection refused, Docker marks the container unhealthy but does not restart it — it keeps running indefinitely broken. Fix: Systemd timer docker-health-watchdog.timer (runs every 10 minutes):

#!/bin/bash
# /usr/local/bin/docker-health-watchdog.sh
UNHEALTHY=$(docker ps --filter health=unhealthy --format '{{.Names}}')
for container in $UNHEALTHY; do
  # Only restart if unhealthy for at least 3 consecutive checks (30 min)
  failures=$(docker inspect "$container" | \
    python3 -c "import json,sys; h=json.load(sys.stdin)[0]['State']['Health']['Log']; \
    print(sum(1 for l in h[-3:] if l['ExitCode']!=0))")
  if [[ "$failures" -eq 3 ]]; then
    docker restart "$container"
    echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Auto-restarted: $container (unhealthy 3x)" \
      >> /var/log/docker-watchdog.log
    # Telegram notify (reads token from $HERMES_HOME/.env)
  fi
done

Safety: Never restarts a container that just became unhealthy (3-check cooldown). Logs every restart. Only targets health-check failures, not intentionally stopped containers. Rollback: systemctl disable docker-health-watchdog.timer

2.2 Fix `hermes-root-backup` Git Diverge

Current failure: Git fast-forward fails every ~10 minutes since 16:25 today (~30+ silent failures). Fix: Patch the backup script to handle diverge gracefully:

if ! git pull --ff-only 2>/dev/null; then
    # Log the diverge
    git log --oneline -3 HEAD > /tmp/hermes-diverge-before.txt
    git log --oneline -3 origin/main >> /tmp/hermes-diverge-before.txt
    # Try rebase first (preserves local commits if intentional)
    if ! git pull --rebase 2>/dev/null; then
        # If rebase fails, reset to origin (backup is the source of truth)
        git reset --hard origin/main
        notify_telegram "⚠️ hermes-root-backup: diverged branch reset to origin/main"
    fi
fi

Risk: git reset --hard loses any local-only commits on the backup repo. Acceptable here because the backup script's job is to push to origin — local commits shouldn't exist. Add a pre-check: if local commits exist that aren't on origin, alert instead of resetting.

2.3 Container Memory Limits

Validated against actual RSS data (Phase 2 data collected 2026-05-27):

Category	Current RSS	Proposed Limit	Reservation	Notes
Next.js web frontends	17–37 MB	`256m`	`64m`	7× headroom for webpack spikes
Node/Fastify backends	20–67 MB	`384m`	`128m`	Allows burst for LLM calls
`invttrdg-backend`	107 MB	`512m`	`256m`	High I/O service; watch after 0.3
`trading-backend`	92 MB	`512m`	`256m`	Active algo trading service
`platform-service`	66 MB	`384m`	`128m`	Shared auth/platform layer
CosmosDB emulator	145 MB	`1g`	`512m`	Can spike on write bursts
Prometheus	57 MB	`256m`	`128m`	Stable but grows with series
Loki	53 MB	`256m`	`128m`	Log ingestion can spike
Caddy	27 MB	`128m`	`64m`	Proxy, very stable
Valkey (Redis)	3.5 MB	`128m`	`32m`	Cache, tiny
Gitea	79 MB	`512m`	`256m`	Git operations can spike
Ollama	130 MB idle	No limit	—	Must accommodate model load (up to 8 GB)

Rollout strategy:

Run docker stats baseline for 24h to confirm no container spikes beyond proposed limits
Apply limits per stack in docker-compose files (not docker update — recreate on next deploy)
Monitor for OOMKill events: dmesg | grep -i oom for 48h after rollout
Never set limits on Ollama — model loading is unpredictable and limits would kill inference

Phase 3 — Dashboard Control Plane (Weeks 3–4)

Prerequisite for all Phase 3 items: Phase 0.1 (host volume mount) must be complete.

3.1 VM Score Card (Automated)

Where: Dashboard /vm — top summary widget, auto-refreshes every 5 min Scoring algorithm (0–100):

CPU efficiency:     20 pts  (steal < 2% = 20, < 5% = 15, < 10% = 10, ≥ 10% = 5)
Memory pressure:    20 pts  (available > 6 GB = 20, > 3 GB = 15, > 1 GB = 5, else = 0)
Disk health:        15 pts  (< 40% used = 15, < 55% = 10, < 70% = 5, else = 0)
Service health:     20 pts  (0 unhealthy = 20, 1–2 = 15, 3–5 = 8, 6+ = 2)
Maintenance hygiene: 15 pts (last cleanup < 7 days + freed > 0 = 15, < 30 days = 8, else = 0)
LLM readiness:      10 pts  (> 8 GB free RAM = 10, > 4 GB = 7, > 2 GB = 4, else = 1)

Score = sum. Display as gauge. Each dimension clickable to drill into its data. Dependencies: Phase 1.2 (steal time in health check output).

3.2 Cron Schedule & History Panel

Where: Dashboard /vm — "Maintenance" tab What:

Live table: 4 cron jobs × (last run, result, freed MB, next scheduled, "Run now" button)
Last 30 cleanup runs as a sparkline: date vs MB freed
One-click trigger for weekly / monthly / dry-run

Backend endpoint: GET /api/vm/cron-status — parse structured log + crontab Dependency: Phase 0.1 (volume mount), Phase 1.1 (structured log parser).

3.3 Container Management Panel

Where: Dashboard /vm — "Containers" tab What:

Full list: name, stack, health status, uptime, CPU %, RAM, restart count
Filter chips: All | Unhealthy | No Memory Limit | By stack
Per-container: Restart, View last 50 log lines, Show health check history
Bulk: "Restart all unhealthy" with confirmation modal

New backend endpoints: GET /api/vm/containers, POST /api/vm/containers/:name/restart, GET /api/vm/containers/:name/logs

3.4 Ollama / LLM Panel

Where: Dashboard /vm — "Models" tab What:

Models list: name, size, last used timestamp
Currently loaded (from ollama ps): model name, RAM used, CPU %, expires in
RAM visualisation bar: [used by system] [model if loaded] [free]
Warning banner: "Loading llama3.2-vision (7.8 GB) will leave ~1.2 GB free — swap pressure likely"
Load / Unload model buttons

Backend endpoints: GET /api/vm/ollama/models, POST /api/vm/ollama/load, DELETE /api/vm/ollama/unload Note: qwen2.5-coder:1.5b is currently loaded — confirmed via ollama ps.

Phase 4 — Trend Analysis (Weeks 5–6)

Key architectural note: Prometheus + cAdvisor + node-exporter are already running and storing ~2 weeks of metrics history including steal time, disk I/O, memory, container CPU/RAM. Do NOT create a separate Cosmos DB store. Query Prometheus directly.

4.1 Prometheus Query Endpoints in Dashboard Backend

Where: New GET /api/vm/metrics/trend endpoint group What: Proxy queries to internal Prometheus (http://prometheus:9090 within Docker network):

/api/vm/metrics/trend/disk?range=7d       → disk usage % over time
/api/vm/metrics/trend/memory?range=7d     → available RAM + swap used over time
/api/vm/metrics/trend/steal?range=7d      → CPU steal % over time (once 1.2 is deployed)
/api/vm/metrics/trend/containers?range=7d → unhealthy container count over time
/api/vm/metrics/trend/io?range=7d         → block write rate (flag invttrdg spikes)

Note: devops-backend is on dashboard_default network, Prometheus is on learning_ai_common_plat_default. Either add devops-backend to Prometheus network, or expose Prometheus on a host port (internal only, not via Caddy).

4.2 Trend Charts on Dashboard

Where: Dashboard /vm — collapsible "Trends" section below score card What (7-day / 30-day toggle):

Disk % over time + linear projection line → "estimated to hit 55% warning in X days"
Swap used over time (detect slow memory leak)
CPU steal % over time (detect host degradation trend)
Unhealthy container count per day
Block write rate: flag days with invttrdg-backend anomalies

Library recommendation: Recharts (already likely in the Next.js project) or lightweight Chart.js wrapper.

4.3 Weekly Digest (Telegram)

Where: New cron job — Monday 08:00 UTC — vm-cleanup.sh --weekly-digest What:

📊 Weekly VM Digest — srv1491630
Week ending 2026-06-01

🖥 CPU Steal:  8.2% avg  ⚠️ (host contention — escalate if > 10%)
💾 Disk:       37% (freed 257 MB this week via cleanup)
🧠 RAM:        10 GB free avg  ✓
🔄 Swap peak:  1.4 GB  ⚠️
🐳 Containers: 7 unhealthy (action required)
🤖 LLMs run:   qwen2.5-coder:1.5b (3 sessions this week)
🧹 Cleanups:   1 standard, 0 full
📅 Next full:  2026-06-01

Top action: Restart 7 unhealthy web containers

Dependency: Phase 4.1 (needs Prometheus for weekly averages), Phase 1.2 (steal metric must be in Prometheus).

Phase 5 — Advanced / Backlog

Item	Description	Trigger condition
Add Grafana	Container alongside Prometheus for richer dashboards; pre-built node-exporter dashboards available	Phase 4 charts feel limited
Deployment ↔ health correlation	Mark deploys on trend charts; correlate health dips to specific releases	After Phase 4.2 exists
Multi-VM support	Extend all above to aggregate across VMs	Adding second VM
`invttrdg-backend` write audit	Persistent investigation: what generates 22 GB/day of block writes? Add per-container I/O alert	After Phase 0.3
Chaos validation	Monthly: watchdog stops a test container, verify restart within 10 min, report result	After Phase 2.1
Ollama GPU readiness check	Detect GPU availability, surface in LLM panel as "GPU: none — inference will be slow"	Before adding large models
Container image freshness	Alert when container is running image > 30 days old (not rebuilt)	When deploy pipeline matures
Cost attribution	Tag containers by product (trading, notes, clock...) — RAM/CPU cost per product	When billing needed
Backup health tracking	`hermes-root-backup` and `uma-hermes-backup` results surfaced in dashboard	After Phase 2.2

Implementation Order

Day 1–2   Phase 0  ── Fix broken foundations (VM module, logrotate, I/O investigation)
                       ⚠️ MUST complete before any Phase 3 dashboard work

Week 1    Phase 1  ── Observability (steal metric, cron history, unhealthy detail, swap)
                       1.2 (steal) → unblocks 3.1 (score card)
                       1.1 (cron log format) → unblocks 3.2 (cron panel)

Week 2    Phase 2  ── Self-healing (watchdog, hermes-backup fix, memory limits)
                       2.1 requires: logrotate entry (Phase 0.2)
                       2.3 requires: 24h baseline observation first

Weeks 3–4 Phase 3  ── Dashboard control (score card, cron panel, containers, Ollama)
                       All require: Phase 0.1 (host volume mount)
                       3.1 requires: Phase 1.2 deployed
                       3.2 requires: Phase 1.1 deployed

Weeks 5–6 Phase 4  ── Trend analysis (Prometheus queries, charts, weekly digest)
                       4.1 requires: devops-backend on same Docker network as Prometheus
                       4.2 requires: Phase 4.1
                       4.3 requires: Phase 4.1 + Phase 1.2

Backlog   Phase 5  ── Advanced items, trigger-based

Success Criteria (how to know each phase is done)

Phase	Done when…
0.1	`curl localhost:4004/api/vm/health` returns valid JSON with disk/load/swap data
0.2	`logrotate -d /etc/logrotate.d/bytelyst-vm` exits 0; logs present in `/var/log`
0.3	Root cause of 22 GB/day writes identified + alert configured
1.1	Dashboard `/vm` shows "Last cleanup: [date], freed [MB]" parsed from log
1.2	`vm-health-check.sh` includes steal % in output; Telegram sends steal alert at > 5%
1.3	Dashboard shows each unhealthy container's last health log + restart button works
2.1	Watchdog restarts an intentionally-broken test container within 30 min
2.2	`hermes-root-backup` runs 10 times without failure after fix deployed
2.3	All containers show memory limits in `docker inspect`; 48h with 0 OOMKill events
3.1	Score card renders live score; each dimension links to its detail
4.1	`/api/vm/metrics/trend/disk?range=7d` returns valid Prometheus time-series JSON
4.3	Telegram receives weekly digest on Monday 08:00 UTC

What This Roadmap Delivers

Today	After roadmap
`/api/vm/health` silently fails	VM module works; health data feeds dashboard
8.2% steal is invisible	Daily alert + trend chart + score card dimension
"7 unhealthy" — no context, no fix	Drill-down to health log; auto-restart within 30 min
Cleanup log is a raw text dump	Structured panel: when, what, how much freed
invttrdg writing 22 GB/day — undetected	I/O alert + investigation complete
No memory guardrails on 39 containers	Per-container limits; OOM events alerted
2 weeks of Prometheus data — no UI	Trend charts: disk projection, swap, steal over time
Manual VM diagnosis = 30 min SSH session	Score card auto-refreshes every 5 min
Ollama loads silently, may cause swap storm	RAM impact warning before load

Change Log (v1 → v2)

#	What changed	Why
1	Added Phase 0 (fix broken foundations)	devops-backend VM module non-functional; must fix first
2	Phase 4.1 changed from Cosmos DB → Prometheus queries	Prometheus already running with 2 weeks of history; Cosmos would be duplicate
3	Phase 2.1 restart explanation corrected	`unless-stopped` does not react to health check failures; process is alive
4	Phase 1.2 steal time corrected	Requires 2 samples 1s apart, not single `/proc/stat` read
5	Phase 2.3 memory limits validated against actual RSS data	Prevents proposing limits that would OOM running services
6	Phase 5 added invttrdg I/O investigation + Grafana option	22 GB/day block writes is the highest-risk untracked issue on the machine
7	Added Phase 0.2 logrotate for new log files	`/var/log/docker-watchdog.log` would grow unbounded
8	Added architectural decisions section (Prometheus vs Cosmos, host exec strategy)	Prevents wasted build on wrong approach
9	Added success criteria per phase	Makes "done" objective and testable
10	Added explicit phase dependency map	Phase 3 items would fail if built before Phase 0
11	Corrected LLM status: `qwen2.5-coder:1.5b` is currently loaded	`ollama ps` confirmed; not idle as v1 stated

22 KiB Raw Blame History Unescape Escape