bytelyst-devops-tools/docs/VM_OBSERVABILITY_ROADMAP.md
Hermes VM 2fc23d6baa feat(vm): fix devops-backend VM module — Phase 0.1 complete
- Switch backend runner from node:20-alpine to node:20-slim so GNU df
  flags (--output=pcent/avail) work inside the container
- Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw),
  docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars
- Rebuild repository.ts: env-configurable paths, cron history parser,
  unhealthy-container inspector, Ollama model endpoints
- Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama
  models, container restart, model unload
- vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix
  (( count++ )) → count=$(( count + 1 )) for set -e compatibility
- Add docs/VM_OBSERVABILITY_ROADMAP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:13:45 +00:00

400 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# VM Observability & Control Roadmap — v2
**Status:** Draft — Pending Approval
**Last updated:** 2026-05-27
**Scope:** `srv1491630` (Hostinger VM) + DevOps Dashboard (`devops.bytelyst.com`)
**Reviewed:** Yes — v1 audited against live system; 11 issues corrected (see change log at bottom)
---
## Current State Snapshot
| Layer | What exists today | Verified gap |
|---|---|---|
| **Health check** | `vm-health-check.sh` — disk, load, RAM, swap, Docker | No steal time metric; no per-container detail |
| **Cleanup** | `vm-cleanup.sh` — build cache, images, logs, apt, pnpm, HOLD | Runs silently; no structured outcome record |
| **Cron** | 4 scheduled jobs (daily / weekly / monthly) | No execution history; no "last ran / freed X" |
| **Dashboard /vm** | Health check + cleanup log tail + trigger button | **VM module is non-functional** — container has no host volume mounts; all backend calls to host scripts fail silently |
| **Dashboard /system** | CPU, RAM, disk, Docker stats | Missing steal %, container detail, unhealthy drill-down |
| **Prometheus stack** | Prometheus + cAdvisor + node-exporter + Loki — ~2 weeks history | **No Grafana**; trend data exists but no UI to query it |
| **Alerting** | Telegram on WARN/CRIT at 07:00 UTC | No steal time alert; no weekly digest; no cron failure alert |
| **Container restart** | 38/39 containers have `unless-stopped` | `unless-stopped` restarts on *process exit only* — does NOT react to health check failures. 7 containers running but unhealthy (process alive, health endpoint dead) |
| **LLMs (Ollama)** | 9 models on disk; `qwen2.5-coder:1.5b` currently loaded (1.1 GB, 100% CPU) | No RAM impact warning before loading; no dashboard visibility |
| **I/O anomaly** | `invttrdg-backend` writing ~22 GB/day to block storage | Unexplained — no alert, no investigation |
---
## Architectural Decisions (settle these before building)
### A. Trend chart data source
**Options:**
-**Query existing Prometheus** from DevOps dashboard (recommended) — data already there, no new store needed. Add Prometheus query endpoints to dashboard backend, render with a chart library.
- **Add Grafana container** alongside Prometheus — purpose-built for metrics UI, out-of-box dashboards. Extra 80150 MB RAM.
-**New Cosmos DB vm-metrics container** — redundant with Prometheus; wrong tool for time-series.
**Recommendation:** Query Prometheus from the dashboard for Phase 4.2 charts (keeps everything in one UI). Add Grafana in Phase 5 only if dashboard charts feel limiting.
### B. Dashboard → host script execution
The `devops-backend` container currently has **no host volume mounts** and **no sudoers entry**. Phase 3.2 "Run cleanup from dashboard" requires one of:
-**Mount host script + Docker socket** into devops-backend (simplest, lowest risk)
- **Thin host-side agent** (systemd socket-activated, receives commands via Unix socket)
-**SSH from container to host** — unnecessary complexity
**Recommendation:** Mount `/opt/bytelyst/learning_ai_devops_tools/scripts` read-only + `/var/log` for log reading into devops-backend. Add sudoers entry for the cleanup script only.
---
## Phase 0 — Fix Broken Foundations *(Day 12, prerequisite for all UI phases)*
These are not new features — they are bugs in the current system.
#### 0.1 Fix devops-backend VM module (host volume mounts)
**Problem:** `GET /api/vm/health`, `GET /api/vm/cleanup-log`, `POST /api/vm/cleanup` all fail because the container has no access to the host filesystem.
**Fix:** Update `docker-compose.yml` for devops-backend:
```yaml
volumes:
- /opt/bytelyst/learning_ai_devops_tools/scripts:/scripts:ro
- /var/log/vm-cleanup.log:/var/log/vm-cleanup.log:ro
- /var/log/vm-health-check.log:/var/log/vm-health-check.log:ro
```
Update `repository.ts` to use `/scripts/VMs/HostingerVM/vm-cleanup.sh` path, or use env var `VM_SCRIPTS_PATH`.
Add sudoers entry: `nobody ALL=(ALL) NOPASSWD: /scripts/VMs/HostingerVM/vm-cleanup.sh`
**Risk:** Low. Read-only mounts for scripts, append-only for logs.
**Validates:** Run `curl http://localhost:4004/api/vm/health` and confirm JSON response.
#### 0.2 Add logrotate entry for new log files
**Problem:** `/var/log/vm-cleanup.log` and `/var/log/vm-health-check.log` have no rotation policy. Will grow unbounded.
**Fix:** Create `/etc/logrotate.d/bytelyst-vm`:
```
/var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log {
weekly
rotate 8
compress
delaycompress
missingok
notifempty
create 0644 root root
}
```
#### 0.3 Investigate `invttrdg-backend` I/O anomaly
**Problem:** 22.2 GB block writes in 13 hours (~1.7 GB/hr). At this rate: 40 GB/day, will fill the 123 GB free disk in ~3 days of heavy trading activity.
**Fix path:** Check what's being written (WAL logs? tick data? verbose debug logging?). Likely a log level or persistence config issue. Add disk usage alert specific to this container.
**Risk of not fixing:** Disk fills up, all services go down.
---
## Phase 1 — Observability Gaps *(Week 1)*
Read-only additions to existing scripts and the `/vm` dashboard page.
#### 1.1 Cron Job Execution History Panel
**Where:** Dashboard `/vm` page — new "Maintenance Schedule" card
**What:** Add `GET /api/vm/cron-status` endpoint that:
1. Parses crontab entries for the 4 managed jobs (look for `# bytelyst-vm-maintenance` block)
2. Parses `/var/log/vm-cleanup.log` into structured run objects: `{ timestamp, mode, diskBefore, diskAfter, freedMB, steps[], success }`
3. Calculates next run from cron expression
**UI:** Table — job name | schedule | last run | freed | status | next run. Expandable row shows step-by-step log.
**Dependency:** Requires Phase 0.1 (volume mount for log access).
#### 1.2 CPU Steal Time Metric
**Where:** `vm-health-check.sh` + dashboard `/vm` health cards
**What:** Sample `/proc/stat` twice 1 second apart, compute steal %:
```bash
read_steal() { awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat; }
s1=$(read_steal); sleep 1; s2=$(read_steal)
steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{
split(s1,a," "); split(s2,b," ")
delta_steal=b[1]-a[1]; delta_total=b[2]-a[2]
printf "%.1f", (delta_steal/delta_total)*100
}')
```
Thresholds: `> 5%` = WARN, `> 15%` = CRIT.
**Why:** Currently at **8.2%** — silently degrading every API response and LLM inference call.
**Dependency:** None. Self-contained script change.
#### 1.3 Unhealthy Container Detail Panel
**Where:** Dashboard `/vm` — expand container health card
**What:** New `GET /api/vm/containers/unhealthy` endpoint:
- Container name, `unhealthy` since (parse `docker inspect .State.Health.Log[0].Start`)
- Last 3 health check log lines
- Current restart count
**UI:** Expandable per-container row with one-click restart button (calls existing or new `POST /api/vm/containers/:name/restart`).
**Dependency:** Requires Phase 0.1.
#### 1.4 Swap Pressure Indicator
**Where:** `vm-health-check.sh` + dashboard
**What:** Add `SwapCached` as secondary metric. High SwapCached relative to SwapUsed = system was recently under pressure even if swap looks ok now. Surface in daily Telegram alert even when overall = WARN not CRIT.
**Threshold change:** Current `SWAP_USED_WARN_GB=1` triggers today (1.4 GB in use). Consider raising to `1.5` to reduce noise while keeping the `SwapCached > 200MB` as an early warning signal.
---
## Phase 2 — Self-Healing Automation *(Week 2)*
Scripts that fix known recurring issues automatically.
#### 2.1 Health-Check-Aware Container Watchdog
**Why the existing policy isn't enough:** All 38 containers already have `unless-stopped`. That policy restarts on *container process exit* only. When the web server process is alive but the health check endpoint returns `Connection refused`, Docker marks the container `unhealthy` but **does not restart it** — it keeps running indefinitely broken.
**Fix:** Systemd timer `docker-health-watchdog.timer` (runs every 10 minutes):
```bash
#!/bin/bash
# /usr/local/bin/docker-health-watchdog.sh
UNHEALTHY=$(docker ps --filter health=unhealthy --format '{{.Names}}')
for container in $UNHEALTHY; do
# Only restart if unhealthy for at least 3 consecutive checks (30 min)
failures=$(docker inspect "$container" | \
python3 -c "import json,sys; h=json.load(sys.stdin)[0]['State']['Health']['Log']; \
print(sum(1 for l in h[-3:] if l['ExitCode']!=0))")
if [[ "$failures" -eq 3 ]]; then
docker restart "$container"
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Auto-restarted: $container (unhealthy 3x)" \
>> /var/log/docker-watchdog.log
# Telegram notify (reads token from $HERMES_HOME/.env)
fi
done
```
**Safety:** Never restarts a container that just became unhealthy (3-check cooldown). Logs every restart. Only targets health-check failures, not intentionally stopped containers.
**Rollback:** `systemctl disable docker-health-watchdog.timer`
#### 2.2 Fix `hermes-root-backup` Git Diverge
**Current failure:** Git fast-forward fails every ~10 minutes since 16:25 today (~30+ silent failures).
**Fix:** Patch the backup script to handle diverge gracefully:
```bash
if ! git pull --ff-only 2>/dev/null; then
# Log the diverge
git log --oneline -3 HEAD > /tmp/hermes-diverge-before.txt
git log --oneline -3 origin/main >> /tmp/hermes-diverge-before.txt
# Try rebase first (preserves local commits if intentional)
if ! git pull --rebase 2>/dev/null; then
# If rebase fails, reset to origin (backup is the source of truth)
git reset --hard origin/main
notify_telegram "⚠️ hermes-root-backup: diverged branch reset to origin/main"
fi
fi
```
**Risk:** `git reset --hard` loses any local-only commits on the backup repo. Acceptable here because the backup script's job is to *push to* origin — local commits shouldn't exist. Add a pre-check: if local commits exist that aren't on origin, alert instead of resetting.
#### 2.3 Container Memory Limits
**Validated against actual RSS data (Phase 2 data collected 2026-05-27):**
| Category | Current RSS | Proposed Limit | Reservation | Notes |
|---|---|---|---|---|
| Next.js web frontends | 1737 MB | `256m` | `64m` | 7× headroom for webpack spikes |
| Node/Fastify backends | 2067 MB | `384m` | `128m` | Allows burst for LLM calls |
| `invttrdg-backend` | 107 MB | `512m` | `256m` | High I/O service; watch after 0.3 |
| `trading-backend` | 92 MB | `512m` | `256m` | Active algo trading service |
| `platform-service` | 66 MB | `384m` | `128m` | Shared auth/platform layer |
| CosmosDB emulator | 145 MB | `1g` | `512m` | Can spike on write bursts |
| Prometheus | 57 MB | `256m` | `128m` | Stable but grows with series |
| Loki | 53 MB | `256m` | `128m` | Log ingestion can spike |
| Caddy | 27 MB | `128m` | `64m` | Proxy, very stable |
| Valkey (Redis) | 3.5 MB | `128m` | `32m` | Cache, tiny |
| Gitea | 79 MB | `512m` | `256m` | Git operations can spike |
| Ollama | 130 MB idle | **No limit** | — | Must accommodate model load (up to 8 GB) |
**Rollout strategy:**
1. Run `docker stats` baseline for 24h to confirm no container spikes beyond proposed limits
2. Apply limits per stack in docker-compose files (not `docker update` — recreate on next deploy)
3. Monitor for OOMKill events: `dmesg | grep -i oom` for 48h after rollout
4. **Never set limits on Ollama** — model loading is unpredictable and limits would kill inference
---
## Phase 3 — Dashboard Control Plane *(Weeks 34)*
**Prerequisite for all Phase 3 items:** Phase 0.1 (host volume mount) must be complete.
#### 3.1 VM Score Card (Automated)
**Where:** Dashboard `/vm` — top summary widget, auto-refreshes every 5 min
**Scoring algorithm (0100):**
```
CPU efficiency: 20 pts (steal < 2% = 20, < 5% = 15, < 10% = 10, ≥ 10% = 5)
Memory pressure: 20 pts (available > 6 GB = 20, > 3 GB = 15, > 1 GB = 5, else = 0)
Disk health: 15 pts (< 40% used = 15, < 55% = 10, < 70% = 5, else = 0)
Service health: 20 pts (0 unhealthy = 20, 12 = 15, 35 = 8, 6+ = 2)
Maintenance hygiene: 15 pts (last cleanup < 7 days + freed > 0 = 15, < 30 days = 8, else = 0)
LLM readiness: 10 pts (> 8 GB free RAM = 10, > 4 GB = 7, > 2 GB = 4, else = 1)
```
Score = sum. Display as gauge. Each dimension clickable to drill into its data.
**Dependencies:** Phase 1.2 (steal time in health check output).
#### 3.2 Cron Schedule & History Panel
**Where:** Dashboard `/vm` — "Maintenance" tab
**What:**
- Live table: 4 cron jobs × (last run, result, freed MB, next scheduled, "Run now" button)
- Last 30 cleanup runs as a sparkline: date vs MB freed
- One-click trigger for weekly / monthly / dry-run
**Backend endpoint:** `GET /api/vm/cron-status` — parse structured log + crontab
**Dependency:** Phase 0.1 (volume mount), Phase 1.1 (structured log parser).
#### 3.3 Container Management Panel
**Where:** Dashboard `/vm` — "Containers" tab
**What:**
- Full list: name, stack, health status, uptime, CPU %, RAM, restart count
- Filter chips: All | Unhealthy | No Memory Limit | By stack
- Per-container: Restart, View last 50 log lines, Show health check history
- Bulk: "Restart all unhealthy" with confirmation modal
**New backend endpoints:** `GET /api/vm/containers`, `POST /api/vm/containers/:name/restart`, `GET /api/vm/containers/:name/logs`
#### 3.4 Ollama / LLM Panel
**Where:** Dashboard `/vm` — "Models" tab
**What:**
- Models list: name, size, last used timestamp
- Currently loaded (from `ollama ps`): model name, RAM used, CPU %, expires in
- RAM visualisation bar: [used by system] [model if loaded] [free]
- Warning banner: "Loading llama3.2-vision (7.8 GB) will leave ~1.2 GB free — swap pressure likely"
- Load / Unload model buttons
**Backend endpoints:** `GET /api/vm/ollama/models`, `POST /api/vm/ollama/load`, `DELETE /api/vm/ollama/unload`
**Note:** `qwen2.5-coder:1.5b` is currently loaded — confirmed via `ollama ps`.
---
## Phase 4 — Trend Analysis *(Weeks 56)*
**Key architectural note:** Prometheus + cAdvisor + node-exporter are **already running and storing ~2 weeks of metrics history** including steal time, disk I/O, memory, container CPU/RAM. Do NOT create a separate Cosmos DB store. Query Prometheus directly.
#### 4.1 Prometheus Query Endpoints in Dashboard Backend
**Where:** New `GET /api/vm/metrics/trend` endpoint group
**What:** Proxy queries to internal Prometheus (http://prometheus:9090 within Docker network):
```
/api/vm/metrics/trend/disk?range=7d → disk usage % over time
/api/vm/metrics/trend/memory?range=7d → available RAM + swap used over time
/api/vm/metrics/trend/steal?range=7d → CPU steal % over time (once 1.2 is deployed)
/api/vm/metrics/trend/containers?range=7d → unhealthy container count over time
/api/vm/metrics/trend/io?range=7d → block write rate (flag invttrdg spikes)
```
**Note:** `devops-backend` is on `dashboard_default` network, Prometheus is on `learning_ai_common_plat_default`. Either add devops-backend to Prometheus network, or expose Prometheus on a host port (internal only, not via Caddy).
#### 4.2 Trend Charts on Dashboard
**Where:** Dashboard `/vm` — collapsible "Trends" section below score card
**What (7-day / 30-day toggle):**
- Disk % over time + linear projection line → "estimated to hit 55% warning in X days"
- Swap used over time (detect slow memory leak)
- CPU steal % over time (detect host degradation trend)
- Unhealthy container count per day
- Block write rate: flag days with `invttrdg-backend` anomalies
**Library recommendation:** Recharts (already likely in the Next.js project) or lightweight Chart.js wrapper.
#### 4.3 Weekly Digest (Telegram)
**Where:** New cron job — Monday 08:00 UTC — `vm-cleanup.sh --weekly-digest`
**What:**
```
📊 Weekly VM Digest — srv1491630
Week ending 2026-06-01
🖥 CPU Steal: 8.2% avg ⚠️ (host contention — escalate if > 10%)
💾 Disk: 37% (freed 257 MB this week via cleanup)
🧠 RAM: 10 GB free avg ✓
🔄 Swap peak: 1.4 GB ⚠️
🐳 Containers: 7 unhealthy (action required)
🤖 LLMs run: qwen2.5-coder:1.5b (3 sessions this week)
🧹 Cleanups: 1 standard, 0 full
📅 Next full: 2026-06-01
Top action: Restart 7 unhealthy web containers
```
**Dependency:** Phase 4.1 (needs Prometheus for weekly averages), Phase 1.2 (steal metric must be in Prometheus).
---
## Phase 5 — Advanced / Backlog
| Item | Description | Trigger condition |
|---|---|---|
| **Add Grafana** | Container alongside Prometheus for richer dashboards; pre-built node-exporter dashboards available | Phase 4 charts feel limited |
| **Deployment ↔ health correlation** | Mark deploys on trend charts; correlate health dips to specific releases | After Phase 4.2 exists |
| **Multi-VM support** | Extend all above to aggregate across VMs | Adding second VM |
| **`invttrdg-backend` write audit** | Persistent investigation: what generates 22 GB/day of block writes? Add per-container I/O alert | After Phase 0.3 |
| **Chaos validation** | Monthly: watchdog stops a test container, verify restart within 10 min, report result | After Phase 2.1 |
| **Ollama GPU readiness check** | Detect GPU availability, surface in LLM panel as "GPU: none — inference will be slow" | Before adding large models |
| **Container image freshness** | Alert when container is running image > 30 days old (not rebuilt) | When deploy pipeline matures |
| **Cost attribution** | Tag containers by product (trading, notes, clock...) — RAM/CPU cost per product | When billing needed |
| **Backup health tracking** | `hermes-root-backup` and `uma-hermes-backup` results surfaced in dashboard | After Phase 2.2 |
---
## Implementation Order
```
Day 12 Phase 0 ── Fix broken foundations (VM module, logrotate, I/O investigation)
⚠️ MUST complete before any Phase 3 dashboard work
Week 1 Phase 1 ── Observability (steal metric, cron history, unhealthy detail, swap)
1.2 (steal) → unblocks 3.1 (score card)
1.1 (cron log format) → unblocks 3.2 (cron panel)
Week 2 Phase 2 ── Self-healing (watchdog, hermes-backup fix, memory limits)
2.1 requires: logrotate entry (Phase 0.2)
2.3 requires: 24h baseline observation first
Weeks 34 Phase 3 ── Dashboard control (score card, cron panel, containers, Ollama)
All require: Phase 0.1 (host volume mount)
3.1 requires: Phase 1.2 deployed
3.2 requires: Phase 1.1 deployed
Weeks 56 Phase 4 ── Trend analysis (Prometheus queries, charts, weekly digest)
4.1 requires: devops-backend on same Docker network as Prometheus
4.2 requires: Phase 4.1
4.3 requires: Phase 4.1 + Phase 1.2
Backlog Phase 5 ── Advanced items, trigger-based
```
---
## Success Criteria (how to know each phase is done)
| Phase | Done when… |
|---|---|
| 0.1 | `curl localhost:4004/api/vm/health` returns valid JSON with disk/load/swap data |
| 0.2 | `logrotate -d /etc/logrotate.d/bytelyst-vm` exits 0; logs present in `/var/log` |
| 0.3 | Root cause of 22 GB/day writes identified + alert configured |
| 1.1 | Dashboard `/vm` shows "Last cleanup: [date], freed [MB]" parsed from log |
| 1.2 | `vm-health-check.sh` includes steal % in output; Telegram sends steal alert at > 5% |
| 1.3 | Dashboard shows each unhealthy container's last health log + restart button works |
| 2.1 | Watchdog restarts an intentionally-broken test container within 30 min |
| 2.2 | `hermes-root-backup` runs 10 times without failure after fix deployed |
| 2.3 | All containers show memory limits in `docker inspect`; 48h with 0 OOMKill events |
| 3.1 | Score card renders live score; each dimension links to its detail |
| 4.1 | `/api/vm/metrics/trend/disk?range=7d` returns valid Prometheus time-series JSON |
| 4.3 | Telegram receives weekly digest on Monday 08:00 UTC |
---
## What This Roadmap Delivers
| Today | After roadmap |
|---|---|
| `/api/vm/health` silently fails | VM module works; health data feeds dashboard |
| 8.2% steal is invisible | Daily alert + trend chart + score card dimension |
| "7 unhealthy" — no context, no fix | Drill-down to health log; auto-restart within 30 min |
| Cleanup log is a raw text dump | Structured panel: when, what, how much freed |
| invttrdg writing 22 GB/day — undetected | I/O alert + investigation complete |
| No memory guardrails on 39 containers | Per-container limits; OOM events alerted |
| 2 weeks of Prometheus data — no UI | Trend charts: disk projection, swap, steal over time |
| Manual VM diagnosis = 30 min SSH session | Score card auto-refreshes every 5 min |
| Ollama loads silently, may cause swap storm | RAM impact warning before load |
---
## Change Log (v1 → v2)
| # | What changed | Why |
|---|---|---|
| 1 | Added **Phase 0** (fix broken foundations) | devops-backend VM module non-functional; must fix first |
| 2 | Phase 4.1 changed from Cosmos DB → **Prometheus queries** | Prometheus already running with 2 weeks of history; Cosmos would be duplicate |
| 3 | Phase 2.1 restart explanation corrected | `unless-stopped` does not react to health check failures; process is alive |
| 4 | Phase 1.2 steal time corrected | Requires **2 samples** 1s apart, not single `/proc/stat` read |
| 5 | Phase 2.3 memory limits **validated against actual RSS data** | Prevents proposing limits that would OOM running services |
| 6 | Phase 5 added **invttrdg I/O investigation** + Grafana option | 22 GB/day block writes is the highest-risk untracked issue on the machine |
| 7 | Added Phase 0.2 **logrotate** for new log files | `/var/log/docker-watchdog.log` would grow unbounded |
| 8 | Added **architectural decisions** section (Prometheus vs Cosmos, host exec strategy) | Prevents wasted build on wrong approach |
| 9 | Added **success criteria** per phase | Makes "done" objective and testable |
| 10 | Added explicit **phase dependency map** | Phase 3 items would fail if built before Phase 0 |
| 11 | Corrected LLM status: `qwen2.5-coder:1.5b` **is currently loaded** | `ollama ps` confirmed; not idle as v1 stated |