bytelyst-devops-tools/docs/hermes_dashboard_v2_roadmap.md
Hermes VM 14c7a8f59a feat(dashboard): Phase 6 — severity-tagged alerts + per-instance actions + deep links
Closes Phase 6 (the items that don't need a backend change). Three
threads, all on the Hermes Mission Control overview:

1. Severity-tagged alerts on the ops panel
   New `RecentAlerts` component classifies each `recentAlerts` string
   into critical / warn / info by leading token (CRITICAL/ERROR/FATAL
   → critical; INFO/OK → info; default → warn — most ops alerts are
   warnings) and renders a colour-coded badge per alert. A
   per-severity radiogroup filter sits in the panel header with live
   counts. Pure UI — no backend contract change. The watchdog log
   tailer in `hermes-telemetry/repository.ts` already emits structured
   severities for the future migration off of leading-token parsing.

2. Per-instance action row on each `InstanceCard`
   Adds three buttons next to "Open dashboard" / "Copy URL":
     - "Copy SSH command": Tailscale-scoped only — never raw `ssh` —
       and per-instance user (`tailscale ssh root@<ts-ip>` for Vijay,
       `tailscale ssh uma@<ts-ip>` for Bheem). Disabled when the
       snapshot has no Tailscale IP.
     - "View tasks": deep link into the Task Ledger pre-filtered by
       instance via `/hermes/tasks?instance=<id>`.
     - "Open runbook": link to `docs/hermes-operations.md`.
   "How to restart this gateway" is intentionally a runbook link, not
   a button — restarting is privileged and should go through the
   documented procedure, not the dashboard UI.

3. URL-param hydration of the instance switcher
   `HermesInstanceProvider` now reads `?instance=` from the URL on
   mount (and on subsequent navigations to a different value). The
   URL value wins over the persisted localStorage selection so deep
   links from the ops panel land on a pre-filtered pane. The param
   is intentionally not auto-stripped — back/forward and copy-paste
   stay meaningful.

Roadmap status: Phase 6 ticked except trend cards (deferred — needs
client-side history persistence) and theme toggle (deferred — shell
doesn't expose a switch primitive yet). Unified-alerts-feed bullet
partially achieved by the new severity filter; the per-instance roll-up
will land when a UI consumer is built for the Phase 3 telemetry
endpoint.

Verified: typecheck , build , 7/7 E2E  (the existing switcher
test exercises the new context code path; URL hydration is covered
indirectly by the deep-link button → Task Ledger pre-filter).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 08:03:57 +00:00

216 lines
26 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Hermes Mission Control v2 — Two-Instance Dashboard Roadmap
**Date:** 2026-05-30
**Owner:** ByteLyst / S (`saravanakumardb`)
**Repo:** `learning_ai_devops_tools` (GitHub remote: `bytelyst-devops-tools`)
**Dashboard:** `dashboard/` — Next.js 16 web (`web/`, port 3000 / container 3049) + Fastify 5 backend (`backend/`, port 4004)
## What This Roadmap Is
The two existing roadmaps are effectively complete for their original scope:
- `docs/hermes_dashboard_roadmap.md` — built the 7-pane **Hermes Mission Control** UI. All checklist items are checked, **but every pane except the live ops panel renders mock/seed data** from `web/src/lib/hermes`.
- `docs/hermes-setup-upgrade-roadmap.md` — stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions).
This v2 roadmap **supersedes the open dashboard-related items in both** and adds the missing theme: **power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap** (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications).
It does **not** re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend `hermes-ops` module.
## The Two Instances (authoritative topology)
Source of truth is `dashboard/backend/src/modules/hermes-ops/repository.ts`.
| Codename | OS user | `HERMES_HOME` | Gateway service | Private dashboard | Backup timer | Backup repo (local → GitHub) | Drive folder |
|----------|---------|---------------|-----------------|-------------------|--------------|------------------------------|--------------|
| **Vijay** | `root` | `/root/.hermes` | `hermes-gateway.service` (system) | `hermes-root-dashboard.service` → :9119 | `hermes-root-backup.timer` | `/root/repos/bytelyst_hostinger_hermes_vm``saravanakumardb/bytelyst_hostinger_hermes_vm` | Vijay Drive |
| **Bheem** | `uma` | `/home/uma/.hermes` | `uma-hermes-gateway.service` (uma **user** systemd) | `uma-hermes-dashboard.service` → :9120 | `uma-hermes-backup.timer` | `/home/uma/repos/uma_hostinger_hermes_vm``umadev0931/uma_hostinger_hermes_vm` | Bheem Drive |
- Both reachable **privately only** over Tailscale `100.87.53.10` (`:9119` Vijay, `:9120` Bheem). No public Caddy route. This is a hard guardrail.
- Plus a root-level `hermes-emergency-drive-upload.timer` that pushes encrypted bundles to each instance's Google Drive folder.
### Three dashboard surfaces exist today
1. **Native per-instance Hermes dashboards**`:9119` (Vijay) and `:9120` (Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase.
2. **ByteLyst Mission Control** — the `/hermes/*` suite in this repo's DevOps dashboard (7 panes). **Intended to be the unified pane-of-glass over both instances.**
3. **The live `hermes-ops` panel** — embedded in the Mission Control overview (`web/src/components/hermes-ops-panel.tsx`), already rendering **real, both-instance** status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings.
**Decision baked into this roadmap:** invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The `hermes-ops` module is the seed; everything below extends it.
## Goal / Target State
A single private dashboard where, for **both Vijay and Bheem**, S can see at a glance:
- live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — **real, cached, robust**
- everything each Hermes is doing / did / failed / is blocked on — **from real session/cron/task telemetry, filterable by instance**
- backup & disaster-recovery posture **at parity across both instances**
- what needs founder attention, pushed to the right Telegram chat
…with the whole thing private-only, authenticated, tested, and in CI.
---
## Gap Inventory (consolidated)
| ID | Gap | Source | Severity |
|----|-----|--------|----------|
| G1 | 6 of 7 Mission Control panes are mock (`web/src/lib/hermes`) | v1 roadmap / README "read-only mock" | High |
| G2 | Tasks/Products/History/Agents have **no instance dimension** (Vijay vs Bheem) | this review | High |
| G3 | `hermes-ops` backend not hardened: no cache (~20 `execFile` per 60s poll), brittle Uma checks (`ps` string-match + hardcoded `existsSync`), errors swallowed to `null`, **no tests** | REVIEW_ACTIONS P1 #3 | High |
| G4 | No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) | v1 roadmap "real telemetry plan" | High |
| G5 | App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) | REVIEW_ACTIONS | P0P2 |
| G6 | Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle | v1 "Next Improvements" | Med |
| G7 | **Bheem/Uma parity:** no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill | setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) | High |
| G8 | Security/access: devops dashboard hermes routes need auth; `security.redact_secrets` / `privacy.redact_pii` undecided; GitHub/Gitea least-privilege audit + rotation pending | setup roadmap Phase 11 | High |
| G9 | Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated | setup roadmap Phase 6 | Med |
---
## Phase 0 — Guardrails (must hold throughout)
- [ ] No public Caddy route or public listener for **any** Hermes dashboard, the `hermes-ops` API, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback.
- [ ] Keep Hermes command approvals at `manual` or `smart`; no gateway approval bypass.
- [ ] No raw secrets, tokens, OAuth files, `state.db`, or SQLite WAL/SHM in any git backup or in this repo.
- [ ] Re-run the Caddy/port review (`docs/hermes-operations.md`) before adding any route or hostname.
## Phase 1 — Make the unified backend authoritative and hardened (G3)
The `hermes-ops` snapshot becomes the single source of truth for live status. Before building UI on it, harden it.
- [x] Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20 `systemctl`/`git`/`ps`/`du` subprocesses every refresh; serve cached snapshot with `generatedAt`.
- [x] Replace brittle Bheem/Uma checks in `repository.ts` *(runuser `systemctl --user` with ps/existsSync fallback so a failed probe degrades to the legacy check, not a false "down")*:
- `isUmaGatewayActive()` (currently `ps -eo` string match) → `runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service` (or `--machine=uma@.host`).
- `isUmaGatewayEnabled()` (currently hardcoded `existsSync` of a wants-symlink) → `systemctl --user is-enabled` via the same path.
- [x] Stop swallowing every failure to `null` indiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show *unknown* vs *down*.
- [x] Add Zod validation + a stable typed contract for `HermesOpsSnapshot` on the route.
- [x] **Add unit tests for the `hermes-ops` repository** (mock `execFile`/fs) — closes the REVIEW_ACTIONS "only `services` has tests" gap for this module.
- [ ] Read Bheem/Uma state via a **self-reporting ops exporter** (Decision #2): a read-only `uma` user-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships: `runuser -u uma -- systemctl --user is-active/is-enabled` instead of the `ps`/`existsSync` checks.
## Phase 2 — Instance dimension across Mission Control (G2)
- [x] Add `instanceId: 'vijay' | 'bheem'` to the core types in `web/src/lib/hermes` (`HermesTask`, `HermesProduct`, `HermesEvent`, `HermesRun`, agent/overview models) and to the backend contracts. *(Web: `instanceId` now on `HermesProduct`, `HermesTask`, `HermesEvent`, `HermesRun`, `HermesAgentStatus` (with a `'all'` literal for cross-cutting agents like Hermes Core / GitHub link). Seed data deterministically split ~50/50 across instances. Backend ops contract already carried per-instance shape under `HermesOpsSnapshot.instances` from Phase 1; no separate backend change needed for this slice.)*
- [x] Add a global **instance switcher** in `HermesShell` (`All` / `Vijay (root)` / `Bheem (uma)`) with persisted selection; thread it through every pane. *(New `HermesInstanceProvider` (React context, localStorage-backed under key `hermes.instanceFilter.v1`, with SSR-safe default to avoid hydration mismatch) mounted in `app/hermes/layout.tsx`. New `HermesInstanceSwitcher` segmented control rendered in the layout header above every pane. Every pane reads `useHermesInstance()` and threads the value into the data-fetcher.)*
- [x] Overview: show per-instance cards **and** a combined roll-up. *(New "Per-instance roll-up" section on `/hermes` always shows Vijay and Bheem side-by-side with active/blocked/failed/success-rate cells regardless of the switcher state — that's the "always cross-instance" comparison view, while the eight metric cards above it are filtered by the switcher.)*
- [x] Ledger / Products / History / Agents: filter and badge by instance. *(`HermesInstanceBadge` component shipped; tasks (Active Missions + Task Ledger), product cards (overview minicards + portfolio cards), and agent rows now show their instance. Filter helpers `getHermesTasks({instance})`, `getHermesProducts(view, instance)`, `getHermesAgents(instance)`, `getHermesHistory(instance)`, `getHermesOverview(instance)` all accept the filter and short-circuit `'all'`. New unit tests in `lib/hermes.test.ts` cover the filter semantics. New E2E test asserts the switcher's radiogroup, default selection, and persistence-friendly state change. 7/7 E2E + 13/13 web unit tests green.)*
## Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4)
Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live).
- [x] **Primary source = real artifacts** (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes *session* as the work unit. The JSONL → SQLite → SSE pipeline is **deferred/optional**, added later via a gateway hook only if the session/cron view proves insufficient. *(New `backend/src/modules/hermes-telemetry` module + `GET /api/hermes/telemetry/:instance` admin-only endpoint. Each section carries its own `ProbeStatus` so the UI can distinguish "definitely empty" from "couldn't read the source". 30s TTL cache + in-flight coalescing, mirrors hermes-ops. JSONL → SQLite → SSE explicitly deferred per Decision #1.)*
- [x] Backend endpoints per instance, reading real Hermes state:
- [x] Sessions + stats (`hermes sessions stats --json`).
- [x] Cron jobs (`hermes cron list --json`).
- [x] Memory + skills inventory (`hermes memory list --json`, `hermes skills list --json`).
- [x] Watchdog alerts feed (tails `~/.hermes/logs/hermes-health-watchdog.log`, severity-bucketed `info`/`warn`/`critical`).
- [x] Backup history (`git -C <repo> log` — last 20 commits per backup repo).
- [ ] Convert **Task Ledger** (`/hermes/tasks`) + **Task Detail** to the real task/event source. *(Deferred: needs the JSONL/SQLite session-events pipeline that Decision #1 marked as optional. Task Ledger remains seed-data; flip when a real source ships.)*
- [ ] Convert **Agents** (`/hermes/agents`) to real toolset/integration status per instance. *(Deferred: agent statuses are currently seed; the telemetry endpoint exposes the raw memory/skills inventory but agent health observability needs a separate ingestion contract.)*
- [ ] Convert **History** (`/hermes/history`) to real session/cron/backup trends. *(Deferred: depends on real session timeseries.)*
- [x] **Products** (`/hermes/products`): repoint at the real service registry (`backend/src/modules/services/`) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later. *(Page rewritten: top "Live services" section sources from `api.getServices()` joined with `api.getHealth()` (real Cosmos-backed registry + 30s-cached health probes), with per-service status, response time, last deploy, last health check. The 50-item seed remains below in a clearly-labelled "Planned products (seed data)" section per the roadmap's "optional manual entries for not-yet-deployed products come later" note. New E2E mocks for `/api/services` + `/api/health` keep the suite deterministic.)*
## Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7)
This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only.
> **VM ops, not codebase work.** This phase requires sudo on the Hostinger VM, Uma-owned GitHub credentials, and Telegram bot tokens — none of it is editable in this repo. The full delegation brief is in [`docs/prompts/phase4-bheem-uma-parity.md`](./prompts/phase4-bheem-uma-parity.md). When the brief's Definition-of-Done is met, tick the boxes below and the summary line at the bottom of this file.
- [ ] Stand up a **Uma persistent backup repo + `uma-hermes-backup.timer`** mirroring the root design (sanitized `hermes_persistent_backup/`, secrets and `state.db` excluded), pushing to `umadev0931/uma_hostinger_hermes_vm` **with a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5)**.
- [ ] Install a **Uma health watchdog** (mirror `scripts/hermes-health-watchdog.py`), silent-on-success, alerting Uma's Telegram.
- [ ] Run the **first Uma restore rehearsal** into a temporary `HERMES_HOME`; document in `docs/hermes-operations.md` / `docs/hermes-disaster-recovery.md`.
- [ ] Schedule a **quarterly Uma restore-drill reminder** (parity with root).
- [ ] Confirm these close the corresponding Bheem warnings emitted by `getHermesOpsSnapshot()` (backup timer active, repo HEAD readable + clean, Google token present).
## Phase 5 — Dashboard app hardening (G5)
- [x] **P0:** Fix the CI workspace path (`${{ gitea.workspace }}`) in `.gitea/workflows/ci.yml`, `DEPLOYMENT.md`, `scripts/deploy-hotcopy.sh` (currently point at non-existent `/opt/bytelyst/bytelyst-devops-tools/...`).
- [x] **P0:** Replace the no-op `lint` echo with real linting (`next lint` for web, minimal ESLint for backend); make `pnpm lint` fail on bad code.
- [x] **P1:** Add tests for `auth`, `csrf`, `deployments/orchestrator`, `health`, **and `hermes-ops`**; add `pnpm test:coverage` gate. *(35 new unit tests; v8 coverage thresholds gated on the six tested files in `backend/vitest.config.ts` (≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.)*
- [x] **P1:** Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. *(Chose **remove**: dropped `fastify-sse-v2` dep, deleted commented-out plugin import + TODO from `server.ts` and `deployments/routes.ts`, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls `/deployments/:id/logs` via `apiRequest` — no UI change needed. If a real-time stream is wanted later, implement via `reply.raw` and update docs in the same change.)*
- [x] **P1:** Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). *(`DEPLOYMENT.md` is now canonical; `DEPLOYMENT_GUIDE.md` reduced to a redirect stub; `deploy.sh` updated. Added an explicit "Ports — quick reference" table to `DEPLOYMENT.md` distinguishing container `:3000`, Compose host `:3049`, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.)*
- [x] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). *(New "Privilege Surface" section in `dashboard/DEPLOYMENT.md` enumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix: `/code-quality/check` was reachable unauthenticated despite shelling out to `npm run` in a caller-supplied path — `requireAdmin` added. Allow-list wrapper around `docker`/`bash`/`npm` invocations and `projectPath` validation are queued as the next P1s; running the container as non-root and replacing the raw `docker.sock` with a verb-restricted proxy are P2/P3.)*
- [x] **P2:** Structured backend logging (pino → stdout); wire E2E (`hermes.spec.ts`) into CI with a started stack. *(Two commits: (1) `lib/logger.ts` exposes a configured pino instance shared between Fastify (via `loggerInstance`) and any non-request code path, with `LOG_LEVEL` env knob and built-in redaction for Authorization/Cookie headers + common secret-shaped field names; runtime `console.error` sites in deployments/orchestrator, system, backup, and vm modules ported over to structured logs. (2) E2E in CI: hermes spec now intercepts `/api/hermes/ops` with a fixture snapshot so it's deterministic without a live backend; CI workflow runs `playwright install --with-deps chromium` then `pnpm test:e2e` (web suite starts its own Next dev via Playwright's `webServer` config). Verified locally: 6/6 E2E green, 51/51 unit tests green, coverage gate ≥95% lines.)*
## Phase 6 — Mission Control UX polish (G6)
- [x] Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel. *(`RecentAlerts` component classifies each warning by leading token (CRITICAL/ERROR/FATAL → critical; INFO/OK → info; default → warn) and renders a colour-coded badge; a per-severity radiogroup filter sits in the panel header with live counts. UI-only — no backend contract change.)*
- [ ] Trend cards: alert volume and backup-freshness across recent refreshes (per instance). *(Deferred — needs client-side history persistence across refreshes; not enough value yet to justify the localStorage state machine.)*
- [x] Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work. *(Per-instance "View tasks" button on each ops-panel `InstanceCard` links to `/hermes/tasks?instance=<id>`. `HermesInstanceProvider` now hydrates from the `?instance=` URL param on mount (winning over the persisted localStorage selection) and keeps the param meaningful for back/forward + copy-paste.)*
- [x] Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway". *(InstanceCard now exposes "Copy SSH command" (Tailscale-scoped: `tailscale ssh root@<tailscale-ip>` for Vijay, `tailscale ssh uma@<tailscale-ip>` for Bheem — never raw `ssh`), "View tasks" deep link, and "Open runbook" pointing at `docs/hermes-operations.md`. "How to restart this gateway" is intentionally a runbook link rather than a button — restarting is a privileged action that should go through the runbook, not the dashboard.)*
- [ ] Optional dark/light theme toggle if the shell supports it. *(Deferred — design system uses CSS custom properties throughout (`var(--bl-*)`) so a toggle is feasible, but the shell doesn't expose a switch primitive yet.)*
- [ ] Unified alerts feed across both instances on the overview. *(Partially achieved by `recentAlerts` + the new severity filter on the ops panel; full per-instance roll-up of telemetry watchdog alerts is queued behind a UI consumer for the new `/api/hermes/telemetry/:instance` endpoint.)*
## Phase 7 — Security & access (G8)
- [x] Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere). *(Both `/api/hermes/ops` and the new `/api/hermes/telemetry/:instance` now gate on `requireAdmin`. Privilege-surface table in `dashboard/DEPLOYMENT.md` updated to match. The previous "read-only ops snapshot, no auth" carve-out is gone — all Hermes routes are admin-only.)*
- [ ] Decide and document `security.redact_secrets` and `privacy.redact_pii` for gateway sessions (per instance). *(Deferred — needs a founder decision on PII handling for session content; not a code-only change.)*
- [ ] Finish the GitHub/Gitea **least-privilege token audit** (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token). *(Resolves naturally when Phase 4 ships — see the Phase 4 delegation brief.)*
- [x] Keep all hermes data private-only; never expose the `hermes-ops` snapshot or task data on a public route. *(Verified: no Caddy/public route added; the dashboard is bound to `127.0.0.1` and reached via Tailscale or SSH tunnel only — see `dashboard/DEPLOYMENT.md` "Ports — quick reference" + "Privilege Surface" sections. With this commit's `requireAdmin` change, even an attacker with internal network access still needs a valid admin JWT to read the ops snapshot.)*
## Phase 8 — Notifications & Telegram loop (G9)
- [ ] Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy.
- [ ] Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items).
- [ ] Preserve the numbered-emoji progress convention (`1⃣`, `2⃣`, …) for completion updates.
---
## Data Model Additions
```ts
// web/src/lib/hermes (and mirrored in backend contracts)
export type HermesInstanceId = 'vijay' | 'bheem';
export interface HermesInstanceRef {
id: HermesInstanceId;
label: string; // "Vijay / root", "Bheem / Uma"
user: string; // "root" | "uma"
hermesHome: string;
}
// add `instanceId: HermesInstanceId` to:
// HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview
```
## Acceptance Criteria
This roadmap is complete when:
- [ ] The overview, ledger, agents, and history panes render **real data for both Vijay and Bheem**, filterable by instance; only panes without a real source remain (clearly labeled) seed data.
- [ ] `hermes-ops` is cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests.
- [ ] Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows **2/2 healthy** with zero standing Bheem warnings.
- [ ] CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops.
- [ ] Hermes routes require auth and remain private-only; redact policies are decided and documented.
- [ ] Dashboard warnings reach the correct Telegram chat per instance.
## Implementation Status Checklist
Update only with evidence (source review, tests, build output, or browser/VM verification).
- [ ] Phase 0 — Guardrails reconfirmed
- [x] Phase 1 — `hermes-ops` hardened + tested
- [x] Phase 2 — Instance dimension + switcher
- [x] Phase 3 — Real telemetry ingestion + Products pane converted (Task Ledger / Agents / History deferred — depend on JSONL session pipeline, see Phase 3 notes)
- [ ] Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill)
- [x] Phase 5 — App/CI hardening (P0/P1/P2 done; P2 follow-ups in DEPLOYMENT.md mitigation roadmap remain)
- [x] Phase 6 — UX polish (severity tags + deep links + per-instance actions; trend cards + theme toggle deferred)
- [x] Phase 7 — Security & access (auth on hermes routes + privacy stance documented; redact_secrets/redact_pii decision deferred)
- [ ] Phase 8 — Notifications & Telegram
## Decisions (resolved 2026-05-30)
1. **Task data source — derive from real artifacts now; defer the JSONL pipeline.** Hermes' real unit of work is the *session* (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists — `hermes sessions` (+ stats), `hermes cron list` (+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite **later and only if** the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store.
2. **Reading Bheem state — self-reporting ops exporter per instance.** Each instance runs a tiny read-only exporter (Bheem as a `uma` user-systemd timer, Vijay symmetrically) that writes a **sanitized** JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into `/home/uma/.hermes`. **Interim stopgap** until the exporter ships: replace the brittle `ps`/`existsSync` Uma checks with `runuser -u uma -- systemctl --user is-active/is-enabled`.
3. **Products — repoint at the real service registry; drop the fabricated mock.** The dashboard already has a live service registry (`backend/src/modules/services/`, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands.
4. **Auth — reuse platform-service JWT, defense-in-depth.** Put the hermes routes behind the same platform-service auth (`backend/src/lib/auth.ts`) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path).
5. **Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes.** Keep `umadev0931/uma_hostinger_hermes_vm` and Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item.
## Suggested Execution Order
1. Phase 5 P0 (CI path + lint) — unblocks everything.
2. Phase 1 (harden `hermes-ops`) — the foundation the real UI sits on.
3. Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops.
4. Phase 3 (real telemetry, pane by pane).
5. Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last.
Each item is sized to land as a single PR with incremental commits to `main`.