bytelyst-devops-tools/docs/hermes_dashboard_v2_roadmap.md
Hermes VM c6ec1a06ea docs(dashboard): Phase 5 P1 — document privilege surface; gate /code-quality/check
Closes the final Phase 5 P1 checkbox and REVIEW_ACTIONS #6.

The backend container has root-equivalent host access via the docker
socket, host log mounts, and the VM scripts mount, but until now the
"who can do what to the host?" answer was scattered across compose
files and route handlers. This commit centralizes it.

DEPLOYMENT.md gains a "Privilege Surface" section that lists:

  - every host mount + container path + mode + purpose
  - every shell-outing route, the actual commands it runs, and the
    auth gate on each
  - what an admin token can do today (≈ host shell)
  - five known sharp edges (un-allow-listed container names, unvalidated
    projectPath, no per-route audit-log on shell-outs, container runs
    as root, global rate-limit only)
  - a P1 → P3 mitigation roadmap (allow-list wrapper around shell-outs,
    projectPath validation, audit-logging shell-outs, drop root in
    container, replace docker.sock with a verb-restricted proxy)

Concurrent code fix: `POST /code-quality/check` was reachable
**unauthenticated** despite shelling out to `npm run typecheck/lint/
build/test:run` in a caller-supplied `projectPath`. Added
`preHandler: requireAdmin` to bring it in line with every other
shell-outing route in the dashboard. Same commit because the
documentation table promises this gate exists.

REVIEW_ACTIONS #6 marked RESOLVED with the rationale; roadmap checkbox
ticked. Tests, typecheck, lint (0 errors), build, and coverage gate
(≥95% lines on every gated file) all stay green.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 07:05:51 +00:00

20 KiB
Raw Blame History

Hermes Mission Control v2 — Two-Instance Dashboard Roadmap

Date: 2026-05-30 Owner: ByteLyst / S (saravanakumardb) Repo: learning_ai_devops_tools (GitHub remote: bytelyst-devops-tools) Dashboard: dashboard/ — Next.js 16 web (web/, port 3000 / container 3049) + Fastify 5 backend (backend/, port 4004)

What This Roadmap Is

The two existing roadmaps are effectively complete for their original scope:

  • docs/hermes_dashboard_roadmap.md — built the 7-pane Hermes Mission Control UI. All checklist items are checked, but every pane except the live ops panel renders mock/seed data from web/src/lib/hermes.
  • docs/hermes-setup-upgrade-roadmap.md — stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions).

This v2 roadmap supersedes the open dashboard-related items in both and adds the missing theme: power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications).

It does not re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend hermes-ops module.

The Two Instances (authoritative topology)

Source of truth is dashboard/backend/src/modules/hermes-ops/repository.ts.

Codename OS user HERMES_HOME Gateway service Private dashboard Backup timer Backup repo (local → GitHub) Drive folder
Vijay root /root/.hermes hermes-gateway.service (system) hermes-root-dashboard.service → :9119 hermes-root-backup.timer /root/repos/bytelyst_hostinger_hermes_vmsaravanakumardb/bytelyst_hostinger_hermes_vm Vijay Drive
Bheem uma /home/uma/.hermes uma-hermes-gateway.service (uma user systemd) uma-hermes-dashboard.service → :9120 uma-hermes-backup.timer /home/uma/repos/uma_hostinger_hermes_vmumadev0931/uma_hostinger_hermes_vm Bheem Drive
  • Both reachable privately only over Tailscale 100.87.53.10 (:9119 Vijay, :9120 Bheem). No public Caddy route. This is a hard guardrail.
  • Plus a root-level hermes-emergency-drive-upload.timer that pushes encrypted bundles to each instance's Google Drive folder.

Three dashboard surfaces exist today

  1. Native per-instance Hermes dashboards:9119 (Vijay) and :9120 (Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase.
  2. ByteLyst Mission Control — the /hermes/* suite in this repo's DevOps dashboard (7 panes). Intended to be the unified pane-of-glass over both instances.
  3. The live hermes-ops panel — embedded in the Mission Control overview (web/src/components/hermes-ops-panel.tsx), already rendering real, both-instance status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings.

Decision baked into this roadmap: invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The hermes-ops module is the seed; everything below extends it.

Goal / Target State

A single private dashboard where, for both Vijay and Bheem, S can see at a glance:

  • live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — real, cached, robust
  • everything each Hermes is doing / did / failed / is blocked on — from real session/cron/task telemetry, filterable by instance
  • backup & disaster-recovery posture at parity across both instances
  • what needs founder attention, pushed to the right Telegram chat

…with the whole thing private-only, authenticated, tested, and in CI.


Gap Inventory (consolidated)

ID Gap Source Severity
G1 6 of 7 Mission Control panes are mock (web/src/lib/hermes) v1 roadmap / README "read-only mock" High
G2 Tasks/Products/History/Agents have no instance dimension (Vijay vs Bheem) this review High
G3 hermes-ops backend not hardened: no cache (~20 execFile per 60s poll), brittle Uma checks (ps string-match + hardcoded existsSync), errors swallowed to null, no tests REVIEW_ACTIONS P1 #3 High
G4 No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) v1 roadmap "real telemetry plan" High
G5 App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) REVIEW_ACTIONS P0P2
G6 Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle v1 "Next Improvements" Med
G7 Bheem/Uma parity: no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) High
G8 Security/access: devops dashboard hermes routes need auth; security.redact_secrets / privacy.redact_pii undecided; GitHub/Gitea least-privilege audit + rotation pending setup roadmap Phase 11 High
G9 Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated setup roadmap Phase 6 Med

Phase 0 — Guardrails (must hold throughout)

  • No public Caddy route or public listener for any Hermes dashboard, the hermes-ops API, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback.
  • Keep Hermes command approvals at manual or smart; no gateway approval bypass.
  • No raw secrets, tokens, OAuth files, state.db, or SQLite WAL/SHM in any git backup or in this repo.
  • Re-run the Caddy/port review (docs/hermes-operations.md) before adding any route or hostname.

Phase 1 — Make the unified backend authoritative and hardened (G3)

The hermes-ops snapshot becomes the single source of truth for live status. Before building UI on it, harden it.

  • Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20 systemctl/git/ps/du subprocesses every refresh; serve cached snapshot with generatedAt.
  • Replace brittle Bheem/Uma checks in repository.ts (runuser systemctl --user with ps/existsSync fallback so a failed probe degrades to the legacy check, not a false "down"):
    • isUmaGatewayActive() (currently ps -eo string match) → runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service (or --machine=uma@.host).
    • isUmaGatewayEnabled() (currently hardcoded existsSync of a wants-symlink) → systemctl --user is-enabled via the same path.
  • Stop swallowing every failure to null indiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show unknown vs down.
  • Add Zod validation + a stable typed contract for HermesOpsSnapshot on the route.
  • Add unit tests for the hermes-ops repository (mock execFile/fs) — closes the REVIEW_ACTIONS "only services has tests" gap for this module.
  • Read Bheem/Uma state via a self-reporting ops exporter (Decision #2): a read-only uma user-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships: runuser -u uma -- systemctl --user is-active/is-enabled instead of the ps/existsSync checks.

Phase 2 — Instance dimension across Mission Control (G2)

  • Add instanceId: 'vijay' | 'bheem' to the core types in web/src/lib/hermes (HermesTask, HermesProduct, HermesEvent, HermesRun, agent/overview models) and to the backend contracts.
  • Add a global instance switcher in HermesShell (All / Vijay (root) / Bheem (uma)) with persisted selection; thread it through every pane.
  • Overview: show per-instance cards and a combined roll-up (extend the existing "Healthy instances 2/2" pattern from the ops panel to the whole overview).
  • Ledger / Products / History / Agents: filter and badge by instance.

Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4)

Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live).

  • Primary source = real artifacts (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes session as the work unit. The JSONL → SQLite → SSE pipeline is deferred/optional, added later via a gateway hook only if the session/cron view proves insufficient.
  • Backend endpoints per instance, reading real Hermes state:
    • Sessions + stats (hermes sessions stats — baseline today: Vijay 59 sessions/5225 msgs, Bheem 18/635).
    • Cron jobs (hermes cron list) including backup + watchdog timers.
    • Memory + skills inventory.
    • Watchdog alerts feed (from hermes-health-watchdog.py output / logs).
    • Backup history (git log of each backup repo: HEAD, last-commit age, freshness).
  • Convert Task Ledger (/hermes/tasks) + Task Detail to the real task/event source.
  • Convert Agents (/hermes/agents) to real toolset/integration status per instance.
  • Convert History (/hermes/history) to real session/cron/backup trends.
  • Products (/hermes/products): repoint at the real service registry (backend/src/modules/services/) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later.

Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7)

This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only.

  • Stand up a Uma persistent backup repo + uma-hermes-backup.timer mirroring the root design (sanitized hermes_persistent_backup/, secrets and state.db excluded), pushing to umadev0931/uma_hostinger_hermes_vm with a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5).
  • Install a Uma health watchdog (mirror scripts/hermes-health-watchdog.py), silent-on-success, alerting Uma's Telegram.
  • Run the first Uma restore rehearsal into a temporary HERMES_HOME; document in docs/hermes-operations.md / docs/hermes-disaster-recovery.md.
  • Schedule a quarterly Uma restore-drill reminder (parity with root).
  • Confirm these close the corresponding Bheem warnings emitted by getHermesOpsSnapshot() (backup timer active, repo HEAD readable + clean, Google token present).

Phase 5 — Dashboard app hardening (G5)

  • P0: Fix the CI workspace path (${{ gitea.workspace }}) in .gitea/workflows/ci.yml, DEPLOYMENT.md, scripts/deploy-hotcopy.sh (currently point at non-existent /opt/bytelyst/bytelyst-devops-tools/...).
  • P0: Replace the no-op lint echo with real linting (next lint for web, minimal ESLint for backend); make pnpm lint fail on bad code.
  • P1: Add tests for auth, csrf, deployments/orchestrator, health, and hermes-ops; add pnpm test:coverage gate. (35 new unit tests; v8 coverage thresholds gated on the six tested files in backend/vitest.config.ts (≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.)
  • P1: Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. (Chose remove: dropped fastify-sse-v2 dep, deleted commented-out plugin import + TODO from server.ts and deployments/routes.ts, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls /deployments/:id/logs via apiRequest — no UI change needed. If a real-time stream is wanted later, implement via reply.raw and update docs in the same change.)
  • P1: Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). (DEPLOYMENT.md is now canonical; DEPLOYMENT_GUIDE.md reduced to a redirect stub; deploy.sh updated. Added an explicit "Ports — quick reference" table to DEPLOYMENT.md distinguishing container :3000, Compose host :3049, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.)
  • P1: Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). (New "Privilege Surface" section in dashboard/DEPLOYMENT.md enumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix: /code-quality/check was reachable unauthenticated despite shelling out to npm run in a caller-supplied path — requireAdmin added. Allow-list wrapper around docker/bash/npm invocations and projectPath validation are queued as the next P1s; running the container as non-root and replacing the raw docker.sock with a verb-restricted proxy are P2/P3.)
  • P2: Structured backend logging (pino → stdout); wire E2E (hermes.spec.ts) into CI with a started stack.

Phase 6 — Mission Control UX polish (G6)

  • Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel.
  • Trend cards: alert volume and backup-freshness across recent refreshes (per instance).
  • Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work.
  • Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway".
  • Optional dark/light theme toggle if the shell supports it.
  • Unified alerts feed across both instances on the overview.

Phase 7 — Security & access (G8)

  • Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere).
  • Decide and document security.redact_secrets and privacy.redact_pii for gateway sessions (per instance).
  • Finish the GitHub/Gitea least-privilege token audit (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token).
  • Keep all hermes data private-only; never expose the hermes-ops snapshot or task data on a public route.

Phase 8 — Notifications & Telegram loop (G9)

  • Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy.
  • Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items).
  • Preserve the numbered-emoji progress convention (1, 2, …) for completion updates.

Data Model Additions

// web/src/lib/hermes (and mirrored in backend contracts)
export type HermesInstanceId = 'vijay' | 'bheem';

export interface HermesInstanceRef {
  id: HermesInstanceId;
  label: string;        // "Vijay / root", "Bheem / Uma"
  user: string;         // "root" | "uma"
  hermesHome: string;
}

// add `instanceId: HermesInstanceId` to:
//   HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview

Acceptance Criteria

This roadmap is complete when:

  • The overview, ledger, agents, and history panes render real data for both Vijay and Bheem, filterable by instance; only panes without a real source remain (clearly labeled) seed data.
  • hermes-ops is cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests.
  • Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows 2/2 healthy with zero standing Bheem warnings.
  • CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops.
  • Hermes routes require auth and remain private-only; redact policies are decided and documented.
  • Dashboard warnings reach the correct Telegram chat per instance.

Implementation Status Checklist

Update only with evidence (source review, tests, build output, or browser/VM verification).

  • Phase 0 — Guardrails reconfirmed
  • Phase 1 — hermes-ops hardened + tested
  • Phase 2 — Instance dimension + switcher
  • Phase 3 — Real telemetry ingestion + panes converted
  • Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill)
  • Phase 5 — App/CI hardening (P0 done; P1/P2 pending)
  • Phase 6 — UX polish
  • Phase 7 — Security & access
  • Phase 8 — Notifications & Telegram

Decisions (resolved 2026-05-30)

  1. Task data source — derive from real artifacts now; defer the JSONL pipeline. Hermes' real unit of work is the session (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists — hermes sessions (+ stats), hermes cron list (+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite later and only if the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store.
  2. Reading Bheem state — self-reporting ops exporter per instance. Each instance runs a tiny read-only exporter (Bheem as a uma user-systemd timer, Vijay symmetrically) that writes a sanitized JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into /home/uma/.hermes. Interim stopgap until the exporter ships: replace the brittle ps/existsSync Uma checks with runuser -u uma -- systemctl --user is-active/is-enabled.
  3. Products — repoint at the real service registry; drop the fabricated mock. The dashboard already has a live service registry (backend/src/modules/services/, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands.
  4. Auth — reuse platform-service JWT, defense-in-depth. Put the hermes routes behind the same platform-service auth (backend/src/lib/auth.ts) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path).
  5. Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes. Keep umadev0931/uma_hostinger_hermes_vm and Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item.

Suggested Execution Order

  1. Phase 5 P0 (CI path + lint) — unblocks everything.
  2. Phase 1 (harden hermes-ops) — the foundation the real UI sits on.
  3. Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops.
  4. Phase 3 (real telemetry, pane by pane).
  5. Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last.

Each item is sized to land as a single PR with incremental commits to main.