bytelyst-devops-tools/docs/hermes_dashboard_v2_roadmap.md
Hermes VM 14c7a8f59a feat(dashboard): Phase 6 — severity-tagged alerts + per-instance actions + deep links
Closes Phase 6 (the items that don't need a backend change). Three
threads, all on the Hermes Mission Control overview:

1. Severity-tagged alerts on the ops panel
   New `RecentAlerts` component classifies each `recentAlerts` string
   into critical / warn / info by leading token (CRITICAL/ERROR/FATAL
   → critical; INFO/OK → info; default → warn — most ops alerts are
   warnings) and renders a colour-coded badge per alert. A
   per-severity radiogroup filter sits in the panel header with live
   counts. Pure UI — no backend contract change. The watchdog log
   tailer in `hermes-telemetry/repository.ts` already emits structured
   severities for the future migration off of leading-token parsing.

2. Per-instance action row on each `InstanceCard`
   Adds three buttons next to "Open dashboard" / "Copy URL":
     - "Copy SSH command": Tailscale-scoped only — never raw `ssh` —
       and per-instance user (`tailscale ssh root@<ts-ip>` for Vijay,
       `tailscale ssh uma@<ts-ip>` for Bheem). Disabled when the
       snapshot has no Tailscale IP.
     - "View tasks": deep link into the Task Ledger pre-filtered by
       instance via `/hermes/tasks?instance=<id>`.
     - "Open runbook": link to `docs/hermes-operations.md`.
   "How to restart this gateway" is intentionally a runbook link, not
   a button — restarting is privileged and should go through the
   documented procedure, not the dashboard UI.

3. URL-param hydration of the instance switcher
   `HermesInstanceProvider` now reads `?instance=` from the URL on
   mount (and on subsequent navigations to a different value). The
   URL value wins over the persisted localStorage selection so deep
   links from the ops panel land on a pre-filtered pane. The param
   is intentionally not auto-stripped — back/forward and copy-paste
   stay meaningful.

Roadmap status: Phase 6 ticked except trend cards (deferred — needs
client-side history persistence) and theme toggle (deferred — shell
doesn't expose a switch primitive yet). Unified-alerts-feed bullet
partially achieved by the new severity filter; the per-instance roll-up
will land when a UI consumer is built for the Phase 3 telemetry
endpoint.

Verified: typecheck , build , 7/7 E2E  (the existing switcher
test exercises the new context code path; URL hydration is covered
indirectly by the deep-link button → Task Ledger pre-filter).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 08:03:57 +00:00

26 KiB
Raw Blame History

Hermes Mission Control v2 — Two-Instance Dashboard Roadmap

Date: 2026-05-30 Owner: ByteLyst / S (saravanakumardb) Repo: learning_ai_devops_tools (GitHub remote: bytelyst-devops-tools) Dashboard: dashboard/ — Next.js 16 web (web/, port 3000 / container 3049) + Fastify 5 backend (backend/, port 4004)

What This Roadmap Is

The two existing roadmaps are effectively complete for their original scope:

  • docs/hermes_dashboard_roadmap.md — built the 7-pane Hermes Mission Control UI. All checklist items are checked, but every pane except the live ops panel renders mock/seed data from web/src/lib/hermes.
  • docs/hermes-setup-upgrade-roadmap.md — stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions).

This v2 roadmap supersedes the open dashboard-related items in both and adds the missing theme: power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications).

It does not re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend hermes-ops module.

The Two Instances (authoritative topology)

Source of truth is dashboard/backend/src/modules/hermes-ops/repository.ts.

Codename OS user HERMES_HOME Gateway service Private dashboard Backup timer Backup repo (local → GitHub) Drive folder
Vijay root /root/.hermes hermes-gateway.service (system) hermes-root-dashboard.service → :9119 hermes-root-backup.timer /root/repos/bytelyst_hostinger_hermes_vmsaravanakumardb/bytelyst_hostinger_hermes_vm Vijay Drive
Bheem uma /home/uma/.hermes uma-hermes-gateway.service (uma user systemd) uma-hermes-dashboard.service → :9120 uma-hermes-backup.timer /home/uma/repos/uma_hostinger_hermes_vmumadev0931/uma_hostinger_hermes_vm Bheem Drive
  • Both reachable privately only over Tailscale 100.87.53.10 (:9119 Vijay, :9120 Bheem). No public Caddy route. This is a hard guardrail.
  • Plus a root-level hermes-emergency-drive-upload.timer that pushes encrypted bundles to each instance's Google Drive folder.

Three dashboard surfaces exist today

  1. Native per-instance Hermes dashboards:9119 (Vijay) and :9120 (Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase.
  2. ByteLyst Mission Control — the /hermes/* suite in this repo's DevOps dashboard (7 panes). Intended to be the unified pane-of-glass over both instances.
  3. The live hermes-ops panel — embedded in the Mission Control overview (web/src/components/hermes-ops-panel.tsx), already rendering real, both-instance status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings.

Decision baked into this roadmap: invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The hermes-ops module is the seed; everything below extends it.

Goal / Target State

A single private dashboard where, for both Vijay and Bheem, S can see at a glance:

  • live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — real, cached, robust
  • everything each Hermes is doing / did / failed / is blocked on — from real session/cron/task telemetry, filterable by instance
  • backup & disaster-recovery posture at parity across both instances
  • what needs founder attention, pushed to the right Telegram chat

…with the whole thing private-only, authenticated, tested, and in CI.


Gap Inventory (consolidated)

ID Gap Source Severity
G1 6 of 7 Mission Control panes are mock (web/src/lib/hermes) v1 roadmap / README "read-only mock" High
G2 Tasks/Products/History/Agents have no instance dimension (Vijay vs Bheem) this review High
G3 hermes-ops backend not hardened: no cache (~20 execFile per 60s poll), brittle Uma checks (ps string-match + hardcoded existsSync), errors swallowed to null, no tests REVIEW_ACTIONS P1 #3 High
G4 No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) v1 roadmap "real telemetry plan" High
G5 App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) REVIEW_ACTIONS P0P2
G6 Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle v1 "Next Improvements" Med
G7 Bheem/Uma parity: no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) High
G8 Security/access: devops dashboard hermes routes need auth; security.redact_secrets / privacy.redact_pii undecided; GitHub/Gitea least-privilege audit + rotation pending setup roadmap Phase 11 High
G9 Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated setup roadmap Phase 6 Med

Phase 0 — Guardrails (must hold throughout)

  • No public Caddy route or public listener for any Hermes dashboard, the hermes-ops API, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback.
  • Keep Hermes command approvals at manual or smart; no gateway approval bypass.
  • No raw secrets, tokens, OAuth files, state.db, or SQLite WAL/SHM in any git backup or in this repo.
  • Re-run the Caddy/port review (docs/hermes-operations.md) before adding any route or hostname.

Phase 1 — Make the unified backend authoritative and hardened (G3)

The hermes-ops snapshot becomes the single source of truth for live status. Before building UI on it, harden it.

  • Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20 systemctl/git/ps/du subprocesses every refresh; serve cached snapshot with generatedAt.
  • Replace brittle Bheem/Uma checks in repository.ts (runuser systemctl --user with ps/existsSync fallback so a failed probe degrades to the legacy check, not a false "down"):
    • isUmaGatewayActive() (currently ps -eo string match) → runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service (or --machine=uma@.host).
    • isUmaGatewayEnabled() (currently hardcoded existsSync of a wants-symlink) → systemctl --user is-enabled via the same path.
  • Stop swallowing every failure to null indiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show unknown vs down.
  • Add Zod validation + a stable typed contract for HermesOpsSnapshot on the route.
  • Add unit tests for the hermes-ops repository (mock execFile/fs) — closes the REVIEW_ACTIONS "only services has tests" gap for this module.
  • Read Bheem/Uma state via a self-reporting ops exporter (Decision #2): a read-only uma user-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships: runuser -u uma -- systemctl --user is-active/is-enabled instead of the ps/existsSync checks.

Phase 2 — Instance dimension across Mission Control (G2)

  • Add instanceId: 'vijay' | 'bheem' to the core types in web/src/lib/hermes (HermesTask, HermesProduct, HermesEvent, HermesRun, agent/overview models) and to the backend contracts. (Web: instanceId now on HermesProduct, HermesTask, HermesEvent, HermesRun, HermesAgentStatus (with a 'all' literal for cross-cutting agents like Hermes Core / GitHub link). Seed data deterministically split ~50/50 across instances. Backend ops contract already carried per-instance shape under HermesOpsSnapshot.instances from Phase 1; no separate backend change needed for this slice.)
  • Add a global instance switcher in HermesShell (All / Vijay (root) / Bheem (uma)) with persisted selection; thread it through every pane. (New HermesInstanceProvider (React context, localStorage-backed under key hermes.instanceFilter.v1, with SSR-safe default to avoid hydration mismatch) mounted in app/hermes/layout.tsx. New HermesInstanceSwitcher segmented control rendered in the layout header above every pane. Every pane reads useHermesInstance() and threads the value into the data-fetcher.)
  • Overview: show per-instance cards and a combined roll-up. (New "Per-instance roll-up" section on /hermes always shows Vijay and Bheem side-by-side with active/blocked/failed/success-rate cells regardless of the switcher state — that's the "always cross-instance" comparison view, while the eight metric cards above it are filtered by the switcher.)
  • Ledger / Products / History / Agents: filter and badge by instance. (HermesInstanceBadge component shipped; tasks (Active Missions + Task Ledger), product cards (overview minicards + portfolio cards), and agent rows now show their instance. Filter helpers getHermesTasks({instance}), getHermesProducts(view, instance), getHermesAgents(instance), getHermesHistory(instance), getHermesOverview(instance) all accept the filter and short-circuit 'all'. New unit tests in lib/hermes.test.ts cover the filter semantics. New E2E test asserts the switcher's radiogroup, default selection, and persistence-friendly state change. 7/7 E2E + 13/13 web unit tests green.)

Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4)

Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live).

  • Primary source = real artifacts (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes session as the work unit. The JSONL → SQLite → SSE pipeline is deferred/optional, added later via a gateway hook only if the session/cron view proves insufficient. (New backend/src/modules/hermes-telemetry module + GET /api/hermes/telemetry/:instance admin-only endpoint. Each section carries its own ProbeStatus so the UI can distinguish "definitely empty" from "couldn't read the source". 30s TTL cache + in-flight coalescing, mirrors hermes-ops. JSONL → SQLite → SSE explicitly deferred per Decision #1.)
  • Backend endpoints per instance, reading real Hermes state:
    • Sessions + stats (hermes sessions stats --json).
    • Cron jobs (hermes cron list --json).
    • Memory + skills inventory (hermes memory list --json, hermes skills list --json).
    • Watchdog alerts feed (tails ~/.hermes/logs/hermes-health-watchdog.log, severity-bucketed info/warn/critical).
    • Backup history (git -C <repo> log — last 20 commits per backup repo).
  • Convert Task Ledger (/hermes/tasks) + Task Detail to the real task/event source. (Deferred: needs the JSONL/SQLite session-events pipeline that Decision #1 marked as optional. Task Ledger remains seed-data; flip when a real source ships.)
  • Convert Agents (/hermes/agents) to real toolset/integration status per instance. (Deferred: agent statuses are currently seed; the telemetry endpoint exposes the raw memory/skills inventory but agent health observability needs a separate ingestion contract.)
  • Convert History (/hermes/history) to real session/cron/backup trends. (Deferred: depends on real session timeseries.)
  • Products (/hermes/products): repoint at the real service registry (backend/src/modules/services/) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later. (Page rewritten: top "Live services" section sources from api.getServices() joined with api.getHealth() (real Cosmos-backed registry + 30s-cached health probes), with per-service status, response time, last deploy, last health check. The 50-item seed remains below in a clearly-labelled "Planned products (seed data)" section per the roadmap's "optional manual entries for not-yet-deployed products come later" note. New E2E mocks for /api/services + /api/health keep the suite deterministic.)

Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7)

This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only.

VM ops, not codebase work. This phase requires sudo on the Hostinger VM, Uma-owned GitHub credentials, and Telegram bot tokens — none of it is editable in this repo. The full delegation brief is in docs/prompts/phase4-bheem-uma-parity.md. When the brief's Definition-of-Done is met, tick the boxes below and the summary line at the bottom of this file.

  • Stand up a Uma persistent backup repo + uma-hermes-backup.timer mirroring the root design (sanitized hermes_persistent_backup/, secrets and state.db excluded), pushing to umadev0931/uma_hostinger_hermes_vm with a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5).
  • Install a Uma health watchdog (mirror scripts/hermes-health-watchdog.py), silent-on-success, alerting Uma's Telegram.
  • Run the first Uma restore rehearsal into a temporary HERMES_HOME; document in docs/hermes-operations.md / docs/hermes-disaster-recovery.md.
  • Schedule a quarterly Uma restore-drill reminder (parity with root).
  • Confirm these close the corresponding Bheem warnings emitted by getHermesOpsSnapshot() (backup timer active, repo HEAD readable + clean, Google token present).

Phase 5 — Dashboard app hardening (G5)

  • P0: Fix the CI workspace path (${{ gitea.workspace }}) in .gitea/workflows/ci.yml, DEPLOYMENT.md, scripts/deploy-hotcopy.sh (currently point at non-existent /opt/bytelyst/bytelyst-devops-tools/...).
  • P0: Replace the no-op lint echo with real linting (next lint for web, minimal ESLint for backend); make pnpm lint fail on bad code.
  • P1: Add tests for auth, csrf, deployments/orchestrator, health, and hermes-ops; add pnpm test:coverage gate. (35 new unit tests; v8 coverage thresholds gated on the six tested files in backend/vitest.config.ts (≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.)
  • P1: Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. (Chose remove: dropped fastify-sse-v2 dep, deleted commented-out plugin import + TODO from server.ts and deployments/routes.ts, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls /deployments/:id/logs via apiRequest — no UI change needed. If a real-time stream is wanted later, implement via reply.raw and update docs in the same change.)
  • P1: Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). (DEPLOYMENT.md is now canonical; DEPLOYMENT_GUIDE.md reduced to a redirect stub; deploy.sh updated. Added an explicit "Ports — quick reference" table to DEPLOYMENT.md distinguishing container :3000, Compose host :3049, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.)
  • P1: Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). (New "Privilege Surface" section in dashboard/DEPLOYMENT.md enumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix: /code-quality/check was reachable unauthenticated despite shelling out to npm run in a caller-supplied path — requireAdmin added. Allow-list wrapper around docker/bash/npm invocations and projectPath validation are queued as the next P1s; running the container as non-root and replacing the raw docker.sock with a verb-restricted proxy are P2/P3.)
  • P2: Structured backend logging (pino → stdout); wire E2E (hermes.spec.ts) into CI with a started stack. (Two commits: (1) lib/logger.ts exposes a configured pino instance shared between Fastify (via loggerInstance) and any non-request code path, with LOG_LEVEL env knob and built-in redaction for Authorization/Cookie headers + common secret-shaped field names; runtime console.error sites in deployments/orchestrator, system, backup, and vm modules ported over to structured logs. (2) E2E in CI: hermes spec now intercepts /api/hermes/ops with a fixture snapshot so it's deterministic without a live backend; CI workflow runs playwright install --with-deps chromium then pnpm test:e2e (web suite starts its own Next dev via Playwright's webServer config). Verified locally: 6/6 E2E green, 51/51 unit tests green, coverage gate ≥95% lines.)

Phase 6 — Mission Control UX polish (G6)

  • Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel. (RecentAlerts component classifies each warning by leading token (CRITICAL/ERROR/FATAL → critical; INFO/OK → info; default → warn) and renders a colour-coded badge; a per-severity radiogroup filter sits in the panel header with live counts. UI-only — no backend contract change.)
  • Trend cards: alert volume and backup-freshness across recent refreshes (per instance). (Deferred — needs client-side history persistence across refreshes; not enough value yet to justify the localStorage state machine.)
  • Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work. (Per-instance "View tasks" button on each ops-panel InstanceCard links to /hermes/tasks?instance=<id>. HermesInstanceProvider now hydrates from the ?instance= URL param on mount (winning over the persisted localStorage selection) and keeps the param meaningful for back/forward + copy-paste.)
  • Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway". (InstanceCard now exposes "Copy SSH command" (Tailscale-scoped: tailscale ssh root@<tailscale-ip> for Vijay, tailscale ssh uma@<tailscale-ip> for Bheem — never raw ssh), "View tasks" deep link, and "Open runbook" pointing at docs/hermes-operations.md. "How to restart this gateway" is intentionally a runbook link rather than a button — restarting is a privileged action that should go through the runbook, not the dashboard.)
  • Optional dark/light theme toggle if the shell supports it. (Deferred — design system uses CSS custom properties throughout (var(--bl-*)) so a toggle is feasible, but the shell doesn't expose a switch primitive yet.)
  • Unified alerts feed across both instances on the overview. (Partially achieved by recentAlerts + the new severity filter on the ops panel; full per-instance roll-up of telemetry watchdog alerts is queued behind a UI consumer for the new /api/hermes/telemetry/:instance endpoint.)

Phase 7 — Security & access (G8)

  • Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere). (Both /api/hermes/ops and the new /api/hermes/telemetry/:instance now gate on requireAdmin. Privilege-surface table in dashboard/DEPLOYMENT.md updated to match. The previous "read-only ops snapshot, no auth" carve-out is gone — all Hermes routes are admin-only.)
  • Decide and document security.redact_secrets and privacy.redact_pii for gateway sessions (per instance). (Deferred — needs a founder decision on PII handling for session content; not a code-only change.)
  • Finish the GitHub/Gitea least-privilege token audit (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token). (Resolves naturally when Phase 4 ships — see the Phase 4 delegation brief.)
  • Keep all hermes data private-only; never expose the hermes-ops snapshot or task data on a public route. (Verified: no Caddy/public route added; the dashboard is bound to 127.0.0.1 and reached via Tailscale or SSH tunnel only — see dashboard/DEPLOYMENT.md "Ports — quick reference" + "Privilege Surface" sections. With this commit's requireAdmin change, even an attacker with internal network access still needs a valid admin JWT to read the ops snapshot.)

Phase 8 — Notifications & Telegram loop (G9)

  • Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy.
  • Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items).
  • Preserve the numbered-emoji progress convention (1, 2, …) for completion updates.

Data Model Additions

// web/src/lib/hermes (and mirrored in backend contracts)
export type HermesInstanceId = 'vijay' | 'bheem';

export interface HermesInstanceRef {
  id: HermesInstanceId;
  label: string;        // "Vijay / root", "Bheem / Uma"
  user: string;         // "root" | "uma"
  hermesHome: string;
}

// add `instanceId: HermesInstanceId` to:
//   HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview

Acceptance Criteria

This roadmap is complete when:

  • The overview, ledger, agents, and history panes render real data for both Vijay and Bheem, filterable by instance; only panes without a real source remain (clearly labeled) seed data.
  • hermes-ops is cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests.
  • Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows 2/2 healthy with zero standing Bheem warnings.
  • CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops.
  • Hermes routes require auth and remain private-only; redact policies are decided and documented.
  • Dashboard warnings reach the correct Telegram chat per instance.

Implementation Status Checklist

Update only with evidence (source review, tests, build output, or browser/VM verification).

  • Phase 0 — Guardrails reconfirmed
  • Phase 1 — hermes-ops hardened + tested
  • Phase 2 — Instance dimension + switcher
  • Phase 3 — Real telemetry ingestion + Products pane converted (Task Ledger / Agents / History deferred — depend on JSONL session pipeline, see Phase 3 notes)
  • Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill)
  • Phase 5 — App/CI hardening (P0/P1/P2 done; P2 follow-ups in DEPLOYMENT.md mitigation roadmap remain)
  • Phase 6 — UX polish (severity tags + deep links + per-instance actions; trend cards + theme toggle deferred)
  • Phase 7 — Security & access (auth on hermes routes + privacy stance documented; redact_secrets/redact_pii decision deferred)
  • Phase 8 — Notifications & Telegram

Decisions (resolved 2026-05-30)

  1. Task data source — derive from real artifacts now; defer the JSONL pipeline. Hermes' real unit of work is the session (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists — hermes sessions (+ stats), hermes cron list (+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite later and only if the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store.
  2. Reading Bheem state — self-reporting ops exporter per instance. Each instance runs a tiny read-only exporter (Bheem as a uma user-systemd timer, Vijay symmetrically) that writes a sanitized JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into /home/uma/.hermes. Interim stopgap until the exporter ships: replace the brittle ps/existsSync Uma checks with runuser -u uma -- systemctl --user is-active/is-enabled.
  3. Products — repoint at the real service registry; drop the fabricated mock. The dashboard already has a live service registry (backend/src/modules/services/, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands.
  4. Auth — reuse platform-service JWT, defense-in-depth. Put the hermes routes behind the same platform-service auth (backend/src/lib/auth.ts) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path).
  5. Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes. Keep umadev0931/uma_hostinger_hermes_vm and Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item.

Suggested Execution Order

  1. Phase 5 P0 (CI path + lint) — unblocks everything.
  2. Phase 1 (harden hermes-ops) — the foundation the real UI sits on.
  3. Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops.
  4. Phase 3 (real telemetry, pane by pane).
  5. Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last.

Each item is sized to land as a single PR with incremental commits to main.