bytelyst-devops-tools/docs/hermes_dashboard_v2_roadmap.md

# Hermes Mission Control v2 — Two-Instance Dashboard Roadmap

**Date:** 2026-05-30
**Owner:** ByteLyst / S (`saravanakumardb`)
**Repo:** `learning_ai_devops_tools` (GitHub remote: `bytelyst-devops-tools`)
**Dashboard:** `dashboard/` — Next.js 16 web (`web/`, port 3000 / container 3049) + Fastify 5 backend (`backend/`, port 4004)

## What This Roadmap Is

The two existing roadmaps are effectively complete for their original scope:

- `docs/hermes_dashboard_roadmap.md` — built the 7-pane **Hermes Mission Control** UI. All checklist items are checked, **but every pane except the live ops panel renders mock/seed data** from `web/src/lib/hermes`.
- `docs/hermes-setup-upgrade-roadmap.md` — stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions).

This v2 roadmap **supersedes the open dashboard-related items in both** and adds the missing theme: **power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap** (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications).

It does **not** re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend `hermes-ops` module.

## The Two Instances (authoritative topology)

Source of truth is `dashboard/backend/src/modules/hermes-ops/repository.ts`.

| Codename | OS user | `HERMES_HOME` | Gateway service | Private dashboard | Backup timer | Backup repo (local → GitHub) | Drive folder |
|----------|---------|---------------|-----------------|-------------------|--------------|------------------------------|--------------|
| **Vijay** | `root` | `/root/.hermes` | `hermes-gateway.service` (system) | `hermes-root-dashboard.service` → :9119 | `hermes-root-backup.timer` | `/root/repos/bytelyst_hostinger_hermes_vm` → `saravanakumardb/bytelyst_hostinger_hermes_vm` | Vijay Drive |
| **Bheem** | `uma` | `/home/uma/.hermes` | `uma-hermes-gateway.service` (uma **user** systemd) | `uma-hermes-dashboard.service` → :9120 | `uma-hermes-backup.timer` | `/home/uma/repos/uma_hostinger_hermes_vm` → `umadev0931/uma_hostinger_hermes_vm` | Bheem Drive |

- Both reachable **privately only** over Tailscale `100.87.53.10` (`:9119` Vijay, `:9120` Bheem). No public Caddy route. This is a hard guardrail.
- Plus a root-level `hermes-emergency-drive-upload.timer` that pushes encrypted bundles to each instance's Google Drive folder.

### Three dashboard surfaces exist today

1. **Native per-instance Hermes dashboards** — `:9119` (Vijay) and `:9120` (Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase.
2. **ByteLyst Mission Control** — the `/hermes/*` suite in this repo's DevOps dashboard (7 panes). **Intended to be the unified pane-of-glass over both instances.**
3. **The live `hermes-ops` panel** — embedded in the Mission Control overview (`web/src/components/hermes-ops-panel.tsx`), already rendering **real, both-instance** status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings.

**Decision baked into this roadmap:** invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The `hermes-ops` module is the seed; everything below extends it.

## Goal / Target State

A single private dashboard where, for **both Vijay and Bheem**, S can see at a glance:

- live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — **real, cached, robust**
- everything each Hermes is doing / did / failed / is blocked on — **from real session/cron/task telemetry, filterable by instance**
- backup & disaster-recovery posture **at parity across both instances**
- what needs founder attention, pushed to the right Telegram chat

…with the whole thing private-only, authenticated, tested, and in CI.

---

## Gap Inventory (consolidated)

| ID | Gap | Source | Severity |
|----|-----|--------|----------|
| G1 | 6 of 7 Mission Control panes are mock (`web/src/lib/hermes`) | v1 roadmap / README "read-only mock" | High |
| G2 | Tasks/Products/History/Agents have **no instance dimension** (Vijay vs Bheem) | this review | High |
| G3 | `hermes-ops` backend not hardened: no cache (~20 `execFile` per 60s poll), brittle Uma checks (`ps` string-match + hardcoded `existsSync`), errors swallowed to `null`, **no tests** | REVIEW_ACTIONS P1 #3 | High |
| G4 | No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) | v1 roadmap "real telemetry plan" | High |
| G5 | App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) | REVIEW_ACTIONS | P0–P2 |
| G6 | Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle | v1 "Next Improvements" | Med |
| G7 | **Bheem/Uma parity:** no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill | setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) | High |
| G8 | Security/access: devops dashboard hermes routes need auth; `security.redact_secrets` / `privacy.redact_pii` undecided; GitHub/Gitea least-privilege audit + rotation pending | setup roadmap Phase 11 | High |
| G9 | Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated | setup roadmap Phase 6 | Med |

---

## Phase 0 — Guardrails (must hold throughout)

- [ ] No public Caddy route or public listener for **any** Hermes dashboard, the `hermes-ops` API, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback.
- [ ] Keep Hermes command approvals at `manual` or `smart`; no gateway approval bypass.
- [ ] No raw secrets, tokens, OAuth files, `state.db`, or SQLite WAL/SHM in any git backup or in this repo.
- [ ] Re-run the Caddy/port review (`docs/hermes-operations.md`) before adding any route or hostname.

## Phase 1 — Make the unified backend authoritative and hardened (G3)

The `hermes-ops` snapshot becomes the single source of truth for live status. Before building UI on it, harden it.

- [x] Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20 `systemctl`/`git`/`ps`/`du` subprocesses every refresh; serve cached snapshot with `generatedAt`.
- [x] Replace brittle Bheem/Uma checks in `repository.ts` *(runuser `systemctl --user` with ps/existsSync fallback so a failed probe degrades to the legacy check, not a false "down")*:
  - `isUmaGatewayActive()` (currently `ps -eo` string match) → `runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service` (or `--machine=uma@.host`).
  - `isUmaGatewayEnabled()` (currently hardcoded `existsSync` of a wants-symlink) → `systemctl --user is-enabled` via the same path.
- [x] Stop swallowing every failure to `null` indiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show *unknown* vs *down*.
- [x] Add Zod validation + a stable typed contract for `HermesOpsSnapshot` on the route.
- [x] **Add unit tests for the `hermes-ops` repository** (mock `execFile`/fs) — closes the REVIEW_ACTIONS "only `services` has tests" gap for this module.
- [ ] Read Bheem/Uma state via a **self-reporting ops exporter** (Decision #2): a read-only `uma` user-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships: `runuser -u uma -- systemctl --user is-active/is-enabled` instead of the `ps`/`existsSync` checks.

## Phase 2 — Instance dimension across Mission Control (G2)

- [x] Add `instanceId: 'vijay' | 'bheem'` to the core types in `web/src/lib/hermes` (`HermesTask`, `HermesProduct`, `HermesEvent`, `HermesRun`, agent/overview models) and to the backend contracts. *(Web: `instanceId` now on `HermesProduct`, `HermesTask`, `HermesEvent`, `HermesRun`, `HermesAgentStatus` (with a `'all'` literal for cross-cutting agents like Hermes Core / GitHub link). Seed data deterministically split ~50/50 across instances. Backend ops contract already carried per-instance shape under `HermesOpsSnapshot.instances` from Phase 1; no separate backend change needed for this slice.)*
- [x] Add a global **instance switcher** in `HermesShell` (`All` / `Vijay (root)` / `Bheem (uma)`) with persisted selection; thread it through every pane. *(New `HermesInstanceProvider` (React context, localStorage-backed under key `hermes.instanceFilter.v1`, with SSR-safe default to avoid hydration mismatch) mounted in `app/hermes/layout.tsx`. New `HermesInstanceSwitcher` segmented control rendered in the layout header above every pane. Every pane reads `useHermesInstance()` and threads the value into the data-fetcher.)*
- [x] Overview: show per-instance cards **and** a combined roll-up. *(New "Per-instance roll-up" section on `/hermes` always shows Vijay and Bheem side-by-side with active/blocked/failed/success-rate cells regardless of the switcher state — that's the "always cross-instance" comparison view, while the eight metric cards above it are filtered by the switcher.)*
- [x] Ledger / Products / History / Agents: filter and badge by instance. *(`HermesInstanceBadge` component shipped; tasks (Active Missions + Task Ledger), product cards (overview minicards + portfolio cards), and agent rows now show their instance. Filter helpers `getHermesTasks({instance})`, `getHermesProducts(view, instance)`, `getHermesAgents(instance)`, `getHermesHistory(instance)`, `getHermesOverview(instance)` all accept the filter and short-circuit `'all'`. New unit tests in `lib/hermes.test.ts` cover the filter semantics. New E2E test asserts the switcher's radiogroup, default selection, and persistence-friendly state change. 7/7 E2E + 13/13 web unit tests green.)*

## Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4)

Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live).

- [ ] **Primary source = real artifacts** (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes *session* as the work unit. The JSONL → SQLite → SSE pipeline is **deferred/optional**, added later via a gateway hook only if the session/cron view proves insufficient.
- [ ] Backend endpoints per instance, reading real Hermes state:
  - [ ] Sessions + stats (`hermes sessions stats` — baseline today: Vijay 59 sessions/5225 msgs, Bheem 18/635).
  - [ ] Cron jobs (`hermes cron list`) including backup + watchdog timers.
  - [ ] Memory + skills inventory.
  - [ ] Watchdog alerts feed (from `hermes-health-watchdog.py` output / logs).
  - [ ] Backup history (git log of each backup repo: HEAD, last-commit age, freshness).
- [ ] Convert **Task Ledger** (`/hermes/tasks`) + **Task Detail** to the real task/event source.
- [ ] Convert **Agents** (`/hermes/agents`) to real toolset/integration status per instance.
- [ ] Convert **History** (`/hermes/history`) to real session/cron/backup trends.
- [ ] **Products** (`/hermes/products`): repoint at the real service registry (`backend/src/modules/services/`) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later.

## Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7)

This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only.

- [ ] Stand up a **Uma persistent backup repo + `uma-hermes-backup.timer`** mirroring the root design (sanitized `hermes_persistent_backup/`, secrets and `state.db` excluded), pushing to `umadev0931/uma_hostinger_hermes_vm` **with a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5)**.
- [ ] Install a **Uma health watchdog** (mirror `scripts/hermes-health-watchdog.py`), silent-on-success, alerting Uma's Telegram.
- [ ] Run the **first Uma restore rehearsal** into a temporary `HERMES_HOME`; document in `docs/hermes-operations.md` / `docs/hermes-disaster-recovery.md`.
- [ ] Schedule a **quarterly Uma restore-drill reminder** (parity with root).
- [ ] Confirm these close the corresponding Bheem warnings emitted by `getHermesOpsSnapshot()` (backup timer active, repo HEAD readable + clean, Google token present).

## Phase 5 — Dashboard app hardening (G5)

- [x] **P0:** Fix the CI workspace path (`${{ gitea.workspace }}`) in `.gitea/workflows/ci.yml`, `DEPLOYMENT.md`, `scripts/deploy-hotcopy.sh` (currently point at non-existent `/opt/bytelyst/bytelyst-devops-tools/...`).
- [x] **P0:** Replace the no-op `lint` echo with real linting (`next lint` for web, minimal ESLint for backend); make `pnpm lint` fail on bad code.
- [x] **P1:** Add tests for `auth`, `csrf`, `deployments/orchestrator`, `health`, **and `hermes-ops`**; add `pnpm test:coverage` gate. *(35 new unit tests; v8 coverage thresholds gated on the six tested files in `backend/vitest.config.ts` (≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.)*
- [x] **P1:** Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. *(Chose **remove**: dropped `fastify-sse-v2` dep, deleted commented-out plugin import + TODO from `server.ts` and `deployments/routes.ts`, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls `/deployments/:id/logs` via `apiRequest` — no UI change needed. If a real-time stream is wanted later, implement via `reply.raw` and update docs in the same change.)*
- [x] **P1:** Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). *(`DEPLOYMENT.md` is now canonical; `DEPLOYMENT_GUIDE.md` reduced to a redirect stub; `deploy.sh` updated. Added an explicit "Ports — quick reference" table to `DEPLOYMENT.md` distinguishing container `:3000`, Compose host `:3049`, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.)*
- [x] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). *(New "Privilege Surface" section in `dashboard/DEPLOYMENT.md` enumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix: `/code-quality/check` was reachable unauthenticated despite shelling out to `npm run` in a caller-supplied path — `requireAdmin` added. Allow-list wrapper around `docker`/`bash`/`npm` invocations and `projectPath` validation are queued as the next P1s; running the container as non-root and replacing the raw `docker.sock` with a verb-restricted proxy are P2/P3.)*
- [x] **P2:** Structured backend logging (pino → stdout); wire E2E (`hermes.spec.ts`) into CI with a started stack. *(Two commits: (1) `lib/logger.ts` exposes a configured pino instance shared between Fastify (via `loggerInstance`) and any non-request code path, with `LOG_LEVEL` env knob and built-in redaction for Authorization/Cookie headers + common secret-shaped field names; runtime `console.error` sites in deployments/orchestrator, system, backup, and vm modules ported over to structured logs. (2) E2E in CI: hermes spec now intercepts `/api/hermes/ops` with a fixture snapshot so it's deterministic without a live backend; CI workflow runs `playwright install --with-deps chromium` then `pnpm test:e2e` (web suite starts its own Next dev via Playwright's `webServer` config). Verified locally: 6/6 E2E green, 51/51 unit tests green, coverage gate ≥95% lines.)*

## Phase 6 — Mission Control UX polish (G6)

- [ ] Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel.
- [ ] Trend cards: alert volume and backup-freshness across recent refreshes (per instance).
- [ ] Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work.
- [ ] Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway".
- [ ] Optional dark/light theme toggle if the shell supports it.
- [ ] Unified alerts feed across both instances on the overview.

## Phase 7 — Security & access (G8)

- [ ] Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere).
- [ ] Decide and document `security.redact_secrets` and `privacy.redact_pii` for gateway sessions (per instance).
- [ ] Finish the GitHub/Gitea **least-privilege token audit** (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token).
- [ ] Keep all hermes data private-only; never expose the `hermes-ops` snapshot or task data on a public route.

## Phase 8 — Notifications & Telegram loop (G9)

- [ ] Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy.
- [ ] Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items).
- [ ] Preserve the numbered-emoji progress convention (`1️⃣`, `2️⃣`, …) for completion updates.

---

## Data Model Additions

```ts
// web/src/lib/hermes (and mirrored in backend contracts)
export type HermesInstanceId = 'vijay' | 'bheem';

export interface HermesInstanceRef {
  id: HermesInstanceId;
  label: string;        // "Vijay / root", "Bheem / Uma"
  user: string;         // "root" | "uma"
  hermesHome: string;
}

// add `instanceId: HermesInstanceId` to:
//   HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview
```

## Acceptance Criteria

This roadmap is complete when:

- [ ] The overview, ledger, agents, and history panes render **real data for both Vijay and Bheem**, filterable by instance; only panes without a real source remain (clearly labeled) seed data.
- [ ] `hermes-ops` is cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests.
- [ ] Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows **2/2 healthy** with zero standing Bheem warnings.
- [ ] CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops.
- [ ] Hermes routes require auth and remain private-only; redact policies are decided and documented.
- [ ] Dashboard warnings reach the correct Telegram chat per instance.

## Implementation Status Checklist

Update only with evidence (source review, tests, build output, or browser/VM verification).

- [ ] Phase 0 — Guardrails reconfirmed
- [x] Phase 1 — `hermes-ops` hardened + tested
- [x] Phase 2 — Instance dimension + switcher
- [ ] Phase 3 — Real telemetry ingestion + panes converted
- [ ] Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill)
- [x] Phase 5 — App/CI hardening (P0/P1/P2 done; P2 follow-ups in DEPLOYMENT.md mitigation roadmap remain)
- [ ] Phase 6 — UX polish
- [ ] Phase 7 — Security & access
- [ ] Phase 8 — Notifications & Telegram

## Decisions (resolved 2026-05-30)

1. **Task data source — derive from real artifacts now; defer the JSONL pipeline.** Hermes' real unit of work is the *session* (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists — `hermes sessions` (+ stats), `hermes cron list` (+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite **later and only if** the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store.
2. **Reading Bheem state — self-reporting ops exporter per instance.** Each instance runs a tiny read-only exporter (Bheem as a `uma` user-systemd timer, Vijay symmetrically) that writes a **sanitized** JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into `/home/uma/.hermes`. **Interim stopgap** until the exporter ships: replace the brittle `ps`/`existsSync` Uma checks with `runuser -u uma -- systemctl --user is-active/is-enabled`.
3. **Products — repoint at the real service registry; drop the fabricated mock.** The dashboard already has a live service registry (`backend/src/modules/services/`, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands.
4. **Auth — reuse platform-service JWT, defense-in-depth.** Put the hermes routes behind the same platform-service auth (`backend/src/lib/auth.ts`) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path).
5. **Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes.** Keep `umadev0931/uma_hostinger_hermes_vm` and Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item.

## Suggested Execution Order

1. Phase 5 P0 (CI path + lint) — unblocks everything.
2. Phase 1 (harden `hermes-ops`) — the foundation the real UI sits on.
3. Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops.
4. Phase 3 (real telemetry, pane by pane).
5. Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last.

Each item is sized to land as a single PR with incremental commits to `main`.