# Hermes Mission Control v2 — Two-Instance Dashboard Roadmap **Date:** 2026-05-30 **Owner:** ByteLyst / S (`saravanakumardb`) **Repo:** `learning_ai_devops_tools` (GitHub remote: `bytelyst-devops-tools`) **Dashboard:** `dashboard/` — Next.js 16 web (`web/`, port 3000 / container 3049) + Fastify 5 backend (`backend/`, port 4004) ## What This Roadmap Is The two existing roadmaps are effectively complete for their original scope: - `docs/hermes_dashboard_roadmap.md` — built the 7-pane **Hermes Mission Control** UI. All checklist items are checked, **but every pane except the live ops panel renders mock/seed data** from `web/src/lib/hermes`. - `docs/hermes-setup-upgrade-roadmap.md` — stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions). This v2 roadmap **supersedes the open dashboard-related items in both** and adds the missing theme: **power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap** (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications). It does **not** re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend `hermes-ops` module. ## The Two Instances (authoritative topology) Source of truth is `dashboard/backend/src/modules/hermes-ops/repository.ts`. | Codename | OS user | `HERMES_HOME` | Gateway service | Private dashboard | Backup timer | Backup repo (local → GitHub) | Drive folder | |----------|---------|---------------|-----------------|-------------------|--------------|------------------------------|--------------| | **Vijay** | `root` | `/root/.hermes` | `hermes-gateway.service` (system) | `hermes-root-dashboard.service` → :9119 | `hermes-root-backup.timer` | `/root/repos/bytelyst_hostinger_hermes_vm` → `saravanakumardb/bytelyst_hostinger_hermes_vm` | Vijay Drive | | **Bheem** | `uma` | `/home/uma/.hermes` | `uma-hermes-gateway.service` (uma **user** systemd) | `uma-hermes-dashboard.service` → :9120 | `uma-hermes-backup.timer` | `/home/uma/repos/uma_hostinger_hermes_vm` → `umadev0931/uma_hostinger_hermes_vm` | Bheem Drive | - Both reachable **privately only** over Tailscale `100.87.53.10` (`:9119` Vijay, `:9120` Bheem). No public Caddy route. This is a hard guardrail. - Plus a root-level `hermes-emergency-drive-upload.timer` that pushes encrypted bundles to each instance's Google Drive folder. ### Three dashboard surfaces exist today 1. **Native per-instance Hermes dashboards** — `:9119` (Vijay) and `:9120` (Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase. 2. **ByteLyst Mission Control** — the `/hermes/*` suite in this repo's DevOps dashboard (7 panes). **Intended to be the unified pane-of-glass over both instances.** 3. **The live `hermes-ops` panel** — embedded in the Mission Control overview (`web/src/components/hermes-ops-panel.tsx`), already rendering **real, both-instance** status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings. **Decision baked into this roadmap:** invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The `hermes-ops` module is the seed; everything below extends it. ## Goal / Target State A single private dashboard where, for **both Vijay and Bheem**, S can see at a glance: - live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — **real, cached, robust** - everything each Hermes is doing / did / failed / is blocked on — **from real session/cron/task telemetry, filterable by instance** - backup & disaster-recovery posture **at parity across both instances** - what needs founder attention, pushed to the right Telegram chat …with the whole thing private-only, authenticated, tested, and in CI. --- ## Gap Inventory (consolidated) | ID | Gap | Source | Severity | |----|-----|--------|----------| | G1 | 6 of 7 Mission Control panes are mock (`web/src/lib/hermes`) | v1 roadmap / README "read-only mock" | High | | G2 | Tasks/Products/History/Agents have **no instance dimension** (Vijay vs Bheem) | this review | High | | G3 | `hermes-ops` backend not hardened: no cache (~20 `execFile` per 60s poll), brittle Uma checks (`ps` string-match + hardcoded `existsSync`), errors swallowed to `null`, **no tests** | REVIEW_ACTIONS P1 #3 | High | | G4 | No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) | v1 roadmap "real telemetry plan" | High | | G5 | App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) | REVIEW_ACTIONS | P0–P2 | | G6 | Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle | v1 "Next Improvements" | Med | | G7 | **Bheem/Uma parity:** no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill | setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) | High | | G8 | Security/access: devops dashboard hermes routes need auth; `security.redact_secrets` / `privacy.redact_pii` undecided; GitHub/Gitea least-privilege audit + rotation pending | setup roadmap Phase 11 | High | | G9 | Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated | setup roadmap Phase 6 | Med | --- ## Phase 0 — Guardrails (must hold throughout) - [ ] No public Caddy route or public listener for **any** Hermes dashboard, the `hermes-ops` API, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback. - [ ] Keep Hermes command approvals at `manual` or `smart`; no gateway approval bypass. - [ ] No raw secrets, tokens, OAuth files, `state.db`, or SQLite WAL/SHM in any git backup or in this repo. - [ ] Re-run the Caddy/port review (`docs/hermes-operations.md`) before adding any route or hostname. ## Phase 1 — Make the unified backend authoritative and hardened (G3) The `hermes-ops` snapshot becomes the single source of truth for live status. Before building UI on it, harden it. - [ ] Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20 `systemctl`/`git`/`ps`/`du` subprocesses every refresh; serve cached snapshot with `generatedAt`. - [ ] Replace brittle Bheem/Uma checks in `repository.ts`: - `isUmaGatewayActive()` (currently `ps -eo` string match) → `runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service` (or `--machine=uma@.host`). - `isUmaGatewayEnabled()` (currently hardcoded `existsSync` of a wants-symlink) → `systemctl --user is-enabled` via the same path. - [ ] Stop swallowing every failure to `null` indiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show *unknown* vs *down*. - [ ] Add Zod validation + a stable typed contract for `HermesOpsSnapshot` on the route. - [ ] **Add unit tests for the `hermes-ops` repository** (mock `execFile`/fs) — closes the REVIEW_ACTIONS "only `services` has tests" gap for this module. - [ ] Read Bheem/Uma state via a **self-reporting ops exporter** (Decision #2): a read-only `uma` user-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships: `runuser -u uma -- systemctl --user is-active/is-enabled` instead of the `ps`/`existsSync` checks. ## Phase 2 — Instance dimension across Mission Control (G2) - [ ] Add `instanceId: 'vijay' | 'bheem'` to the core types in `web/src/lib/hermes` (`HermesTask`, `HermesProduct`, `HermesEvent`, `HermesRun`, agent/overview models) and to the backend contracts. - [ ] Add a global **instance switcher** in `HermesShell` (`All` / `Vijay (root)` / `Bheem (uma)`) with persisted selection; thread it through every pane. - [ ] Overview: show per-instance cards **and** a combined roll-up (extend the existing "Healthy instances 2/2" pattern from the ops panel to the whole overview). - [ ] Ledger / Products / History / Agents: filter and badge by instance. ## Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4) Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live). - [ ] **Primary source = real artifacts** (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes *session* as the work unit. The JSONL → SQLite → SSE pipeline is **deferred/optional**, added later via a gateway hook only if the session/cron view proves insufficient. - [ ] Backend endpoints per instance, reading real Hermes state: - [ ] Sessions + stats (`hermes sessions stats` — baseline today: Vijay 59 sessions/5225 msgs, Bheem 18/635). - [ ] Cron jobs (`hermes cron list`) including backup + watchdog timers. - [ ] Memory + skills inventory. - [ ] Watchdog alerts feed (from `hermes-health-watchdog.py` output / logs). - [ ] Backup history (git log of each backup repo: HEAD, last-commit age, freshness). - [ ] Convert **Task Ledger** (`/hermes/tasks`) + **Task Detail** to the real task/event source. - [ ] Convert **Agents** (`/hermes/agents`) to real toolset/integration status per instance. - [ ] Convert **History** (`/hermes/history`) to real session/cron/backup trends. - [ ] **Products** (`/hermes/products`): repoint at the real service registry (`backend/src/modules/services/`) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later. ## Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7) This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only. - [ ] Stand up a **Uma persistent backup repo + `uma-hermes-backup.timer`** mirroring the root design (sanitized `hermes_persistent_backup/`, secrets and `state.db` excluded), pushing to `umadev0931/uma_hostinger_hermes_vm` **with a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5)**. - [ ] Install a **Uma health watchdog** (mirror `scripts/hermes-health-watchdog.py`), silent-on-success, alerting Uma's Telegram. - [ ] Run the **first Uma restore rehearsal** into a temporary `HERMES_HOME`; document in `docs/hermes-operations.md` / `docs/hermes-disaster-recovery.md`. - [ ] Schedule a **quarterly Uma restore-drill reminder** (parity with root). - [ ] Confirm these close the corresponding Bheem warnings emitted by `getHermesOpsSnapshot()` (backup timer active, repo HEAD readable + clean, Google token present). ## Phase 5 — Dashboard app hardening (G5) - [ ] **P0:** Fix the CI workspace path (`${{ gitea.workspace }}`) in `.gitea/workflows/ci.yml`, `DEPLOYMENT.md`, `scripts/deploy-hotcopy.sh` (currently point at non-existent `/opt/bytelyst/bytelyst-devops-tools/...`). - [ ] **P0:** Replace the no-op `lint` echo with real linting (`next lint` for web, minimal ESLint for backend); make `pnpm lint` fail on bad code. - [ ] **P1:** Add tests for `auth`, `csrf`, `deployments/orchestrator`, `health`, **and `hermes-ops`**; add `pnpm test:coverage` gate. - [ ] **P1:** Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. - [ ] **P1:** Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). - [ ] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). - [ ] **P2:** Structured backend logging (pino → stdout); wire E2E (`hermes.spec.ts`) into CI with a started stack. ## Phase 6 — Mission Control UX polish (G6) - [ ] Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel. - [ ] Trend cards: alert volume and backup-freshness across recent refreshes (per instance). - [ ] Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work. - [ ] Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway". - [ ] Optional dark/light theme toggle if the shell supports it. - [ ] Unified alerts feed across both instances on the overview. ## Phase 7 — Security & access (G8) - [ ] Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere). - [ ] Decide and document `security.redact_secrets` and `privacy.redact_pii` for gateway sessions (per instance). - [ ] Finish the GitHub/Gitea **least-privilege token audit** (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token). - [ ] Keep all hermes data private-only; never expose the `hermes-ops` snapshot or task data on a public route. ## Phase 8 — Notifications & Telegram loop (G9) - [ ] Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy. - [ ] Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items). - [ ] Preserve the numbered-emoji progress convention (`1️⃣`, `2️⃣`, …) for completion updates. --- ## Data Model Additions ```ts // web/src/lib/hermes (and mirrored in backend contracts) export type HermesInstanceId = 'vijay' | 'bheem'; export interface HermesInstanceRef { id: HermesInstanceId; label: string; // "Vijay / root", "Bheem / Uma" user: string; // "root" | "uma" hermesHome: string; } // add `instanceId: HermesInstanceId` to: // HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview ``` ## Acceptance Criteria This roadmap is complete when: - [ ] The overview, ledger, agents, and history panes render **real data for both Vijay and Bheem**, filterable by instance; only panes without a real source remain (clearly labeled) seed data. - [ ] `hermes-ops` is cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests. - [ ] Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows **2/2 healthy** with zero standing Bheem warnings. - [ ] CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops. - [ ] Hermes routes require auth and remain private-only; redact policies are decided and documented. - [ ] Dashboard warnings reach the correct Telegram chat per instance. ## Implementation Status Checklist Update only with evidence (source review, tests, build output, or browser/VM verification). - [ ] Phase 0 — Guardrails reconfirmed - [ ] Phase 1 — `hermes-ops` hardened + tested - [ ] Phase 2 — Instance dimension + switcher - [ ] Phase 3 — Real telemetry ingestion + panes converted - [ ] Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill) - [ ] Phase 5 — App/CI hardening (P0 → P2) - [ ] Phase 6 — UX polish - [ ] Phase 7 — Security & access - [ ] Phase 8 — Notifications & Telegram ## Decisions (resolved 2026-05-30) 1. **Task data source — derive from real artifacts now; defer the JSONL pipeline.** Hermes' real unit of work is the *session* (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists — `hermes sessions` (+ stats), `hermes cron list` (+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite **later and only if** the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store. 2. **Reading Bheem state — self-reporting ops exporter per instance.** Each instance runs a tiny read-only exporter (Bheem as a `uma` user-systemd timer, Vijay symmetrically) that writes a **sanitized** JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into `/home/uma/.hermes`. **Interim stopgap** until the exporter ships: replace the brittle `ps`/`existsSync` Uma checks with `runuser -u uma -- systemctl --user is-active/is-enabled`. 3. **Products — repoint at the real service registry; drop the fabricated mock.** The dashboard already has a live service registry (`backend/src/modules/services/`, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands. 4. **Auth — reuse platform-service JWT, defense-in-depth.** Put the hermes routes behind the same platform-service auth (`backend/src/lib/auth.ts`) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path). 5. **Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes.** Keep `umadev0931/uma_hostinger_hermes_vm` and Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item. ## Suggested Execution Order 1. Phase 5 P0 (CI path + lint) — unblocks everything. 2. Phase 1 (harden `hermes-ops`) — the foundation the real UI sits on. 3. Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops. 4. Phase 3 (real telemetry, pane by pane). 5. Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last. Each item is sized to land as a single PR with incremental commits to `main`.