Closes the Phase 5 P2 checkbox (second half — first half: pino logging
in 1e64d75). Phase 5 is now fully green.
Two changes:
1. `web/e2e/hermes.spec.ts` now intercepts `/api/hermes/ops` with a
fixture snapshot. The backend's hermes-ops endpoint shells out to
`systemctl` / `git` / `ps` / `du` on the live VM and is therefore
neither available nor deterministic in CI. Mocking it lets the
suite run against the web stack alone (no backend, no live VM).
Fixture shape mirrors the Zod schema in
`backend/src/modules/hermes-ops/types.ts`.
2. `.gitea/workflows/ci.yml` re-enables the previously-commented-out
E2E step. Adds a preceding `playwright install --with-deps
chromium` step so the runner pulls the browser fresh per run.
The web suite starts its own Next dev server via Playwright's
`webServer` config (`pnpm exec next dev -p 3200`), so we do NOT
start the backend in CI — every backend route used by the suite
is mocked via `page.route` (auth, csrf, services, deployments,
health/cache, seed, hermes-ops).
Verified locally: `pnpm exec playwright test` → 6 passed in 19.5s
(2 hermes specs + 4 dashboard/login specs across desktop + mobile).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
21 KiB
Hermes Mission Control v2 — Two-Instance Dashboard Roadmap
Date: 2026-05-30
Owner: ByteLyst / S (saravanakumardb)
Repo: learning_ai_devops_tools (GitHub remote: bytelyst-devops-tools)
Dashboard: dashboard/ — Next.js 16 web (web/, port 3000 / container 3049) + Fastify 5 backend (backend/, port 4004)
What This Roadmap Is
The two existing roadmaps are effectively complete for their original scope:
docs/hermes_dashboard_roadmap.md— built the 7-pane Hermes Mission Control UI. All checklist items are checked, but every pane except the live ops panel renders mock/seed data fromweb/src/lib/hermes.docs/hermes-setup-upgrade-roadmap.md— stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions).
This v2 roadmap supersedes the open dashboard-related items in both and adds the missing theme: power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications).
It does not re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend hermes-ops module.
The Two Instances (authoritative topology)
Source of truth is dashboard/backend/src/modules/hermes-ops/repository.ts.
| Codename | OS user | HERMES_HOME |
Gateway service | Private dashboard | Backup timer | Backup repo (local → GitHub) | Drive folder |
|---|---|---|---|---|---|---|---|
| Vijay | root |
/root/.hermes |
hermes-gateway.service (system) |
hermes-root-dashboard.service → :9119 |
hermes-root-backup.timer |
/root/repos/bytelyst_hostinger_hermes_vm → saravanakumardb/bytelyst_hostinger_hermes_vm |
Vijay Drive |
| Bheem | uma |
/home/uma/.hermes |
uma-hermes-gateway.service (uma user systemd) |
uma-hermes-dashboard.service → :9120 |
uma-hermes-backup.timer |
/home/uma/repos/uma_hostinger_hermes_vm → umadev0931/uma_hostinger_hermes_vm |
Bheem Drive |
- Both reachable privately only over Tailscale
100.87.53.10(:9119Vijay,:9120Bheem). No public Caddy route. This is a hard guardrail. - Plus a root-level
hermes-emergency-drive-upload.timerthat pushes encrypted bundles to each instance's Google Drive folder.
Three dashboard surfaces exist today
- Native per-instance Hermes dashboards —
:9119(Vijay) and:9120(Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase. - ByteLyst Mission Control — the
/hermes/*suite in this repo's DevOps dashboard (7 panes). Intended to be the unified pane-of-glass over both instances. - The live
hermes-opspanel — embedded in the Mission Control overview (web/src/components/hermes-ops-panel.tsx), already rendering real, both-instance status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings.
Decision baked into this roadmap: invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The hermes-ops module is the seed; everything below extends it.
Goal / Target State
A single private dashboard where, for both Vijay and Bheem, S can see at a glance:
- live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — real, cached, robust
- everything each Hermes is doing / did / failed / is blocked on — from real session/cron/task telemetry, filterable by instance
- backup & disaster-recovery posture at parity across both instances
- what needs founder attention, pushed to the right Telegram chat
…with the whole thing private-only, authenticated, tested, and in CI.
Gap Inventory (consolidated)
| ID | Gap | Source | Severity |
|---|---|---|---|
| G1 | 6 of 7 Mission Control panes are mock (web/src/lib/hermes) |
v1 roadmap / README "read-only mock" | High |
| G2 | Tasks/Products/History/Agents have no instance dimension (Vijay vs Bheem) | this review | High |
| G3 | hermes-ops backend not hardened: no cache (~20 execFile per 60s poll), brittle Uma checks (ps string-match + hardcoded existsSync), errors swallowed to null, no tests |
REVIEW_ACTIONS P1 #3 | High |
| G4 | No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) | v1 roadmap "real telemetry plan" | High |
| G5 | App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) | REVIEW_ACTIONS | P0–P2 |
| G6 | Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle | v1 "Next Improvements" | Med |
| G7 | Bheem/Uma parity: no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill | setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) | High |
| G8 | Security/access: devops dashboard hermes routes need auth; security.redact_secrets / privacy.redact_pii undecided; GitHub/Gitea least-privilege audit + rotation pending |
setup roadmap Phase 11 | High |
| G9 | Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated | setup roadmap Phase 6 | Med |
Phase 0 — Guardrails (must hold throughout)
- No public Caddy route or public listener for any Hermes dashboard, the
hermes-opsAPI, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback. - Keep Hermes command approvals at
manualorsmart; no gateway approval bypass. - No raw secrets, tokens, OAuth files,
state.db, or SQLite WAL/SHM in any git backup or in this repo. - Re-run the Caddy/port review (
docs/hermes-operations.md) before adding any route or hostname.
Phase 1 — Make the unified backend authoritative and hardened (G3)
The hermes-ops snapshot becomes the single source of truth for live status. Before building UI on it, harden it.
- Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20
systemctl/git/ps/dusubprocesses every refresh; serve cached snapshot withgeneratedAt. - Replace brittle Bheem/Uma checks in
repository.ts(runusersystemctl --userwith ps/existsSync fallback so a failed probe degrades to the legacy check, not a false "down"):isUmaGatewayActive()(currentlyps -eostring match) →runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service(or--machine=uma@.host).isUmaGatewayEnabled()(currently hardcodedexistsSyncof a wants-symlink) →systemctl --user is-enabledvia the same path.
- Stop swallowing every failure to
nullindiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show unknown vs down. - Add Zod validation + a stable typed contract for
HermesOpsSnapshoton the route. - Add unit tests for the
hermes-opsrepository (mockexecFile/fs) — closes the REVIEW_ACTIONS "onlyserviceshas tests" gap for this module. - Read Bheem/Uma state via a self-reporting ops exporter (Decision #2): a read-only
umauser-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships:runuser -u uma -- systemctl --user is-active/is-enabledinstead of theps/existsSyncchecks.
Phase 2 — Instance dimension across Mission Control (G2)
- Add
instanceId: 'vijay' | 'bheem'to the core types inweb/src/lib/hermes(HermesTask,HermesProduct,HermesEvent,HermesRun, agent/overview models) and to the backend contracts. - Add a global instance switcher in
HermesShell(All/Vijay (root)/Bheem (uma)) with persisted selection; thread it through every pane. - Overview: show per-instance cards and a combined roll-up (extend the existing "Healthy instances 2/2" pattern from the ops panel to the whole overview).
- Ledger / Products / History / Agents: filter and badge by instance.
Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4)
Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live).
- Primary source = real artifacts (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes session as the work unit. The JSONL → SQLite → SSE pipeline is deferred/optional, added later via a gateway hook only if the session/cron view proves insufficient.
- Backend endpoints per instance, reading real Hermes state:
- Sessions + stats (
hermes sessions stats— baseline today: Vijay 59 sessions/5225 msgs, Bheem 18/635). - Cron jobs (
hermes cron list) including backup + watchdog timers. - Memory + skills inventory.
- Watchdog alerts feed (from
hermes-health-watchdog.pyoutput / logs). - Backup history (git log of each backup repo: HEAD, last-commit age, freshness).
- Sessions + stats (
- Convert Task Ledger (
/hermes/tasks) + Task Detail to the real task/event source. - Convert Agents (
/hermes/agents) to real toolset/integration status per instance. - Convert History (
/hermes/history) to real session/cron/backup trends. - Products (
/hermes/products): repoint at the real service registry (backend/src/modules/services/) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later.
Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7)
This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only.
- Stand up a Uma persistent backup repo +
uma-hermes-backup.timermirroring the root design (sanitizedhermes_persistent_backup/, secrets andstate.dbexcluded), pushing toumadev0931/uma_hostinger_hermes_vmwith a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5). - Install a Uma health watchdog (mirror
scripts/hermes-health-watchdog.py), silent-on-success, alerting Uma's Telegram. - Run the first Uma restore rehearsal into a temporary
HERMES_HOME; document indocs/hermes-operations.md/docs/hermes-disaster-recovery.md. - Schedule a quarterly Uma restore-drill reminder (parity with root).
- Confirm these close the corresponding Bheem warnings emitted by
getHermesOpsSnapshot()(backup timer active, repo HEAD readable + clean, Google token present).
Phase 5 — Dashboard app hardening (G5)
- P0: Fix the CI workspace path (
${{ gitea.workspace }}) in.gitea/workflows/ci.yml,DEPLOYMENT.md,scripts/deploy-hotcopy.sh(currently point at non-existent/opt/bytelyst/bytelyst-devops-tools/...). - P0: Replace the no-op
lintecho with real linting (next lintfor web, minimal ESLint for backend); makepnpm lintfail on bad code. - P1: Add tests for
auth,csrf,deployments/orchestrator,health, andhermes-ops; addpnpm test:coveragegate. (35 new unit tests; v8 coverage thresholds gated on the six tested files inbackend/vitest.config.ts(≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.) - P1: Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. (Chose remove: dropped
fastify-sse-v2dep, deleted commented-out plugin import + TODO fromserver.tsanddeployments/routes.ts, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls/deployments/:id/logsviaapiRequest— no UI change needed. If a real-time stream is wanted later, implement viareply.rawand update docs in the same change.) - P1: Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). (
DEPLOYMENT.mdis now canonical;DEPLOYMENT_GUIDE.mdreduced to a redirect stub;deploy.shupdated. Added an explicit "Ports — quick reference" table toDEPLOYMENT.mddistinguishing container:3000, Compose host:3049, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.) - P1: Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). (New "Privilege Surface" section in
dashboard/DEPLOYMENT.mdenumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix:/code-quality/checkwas reachable unauthenticated despite shelling out tonpm runin a caller-supplied path —requireAdminadded. Allow-list wrapper arounddocker/bash/npminvocations andprojectPathvalidation are queued as the next P1s; running the container as non-root and replacing the rawdocker.sockwith a verb-restricted proxy are P2/P3.) - P2: Structured backend logging (pino → stdout); wire E2E (
hermes.spec.ts) into CI with a started stack. (Two commits: (1)lib/logger.tsexposes a configured pino instance shared between Fastify (vialoggerInstance) and any non-request code path, withLOG_LEVELenv knob and built-in redaction for Authorization/Cookie headers + common secret-shaped field names; runtimeconsole.errorsites in deployments/orchestrator, system, backup, and vm modules ported over to structured logs. (2) E2E in CI: hermes spec now intercepts/api/hermes/opswith a fixture snapshot so it's deterministic without a live backend; CI workflow runsplaywright install --with-deps chromiumthenpnpm test:e2e(web suite starts its own Next dev via Playwright'swebServerconfig). Verified locally: 6/6 E2E green, 51/51 unit tests green, coverage gate ≥95% lines.)
Phase 6 — Mission Control UX polish (G6)
- Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel.
- Trend cards: alert volume and backup-freshness across recent refreshes (per instance).
- Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work.
- Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway".
- Optional dark/light theme toggle if the shell supports it.
- Unified alerts feed across both instances on the overview.
Phase 7 — Security & access (G8)
- Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere).
- Decide and document
security.redact_secretsandprivacy.redact_piifor gateway sessions (per instance). - Finish the GitHub/Gitea least-privilege token audit (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token).
- Keep all hermes data private-only; never expose the
hermes-opssnapshot or task data on a public route.
Phase 8 — Notifications & Telegram loop (G9)
- Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy.
- Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items).
- Preserve the numbered-emoji progress convention (
1️⃣,2️⃣, …) for completion updates.
Data Model Additions
// web/src/lib/hermes (and mirrored in backend contracts)
export type HermesInstanceId = 'vijay' | 'bheem';
export interface HermesInstanceRef {
id: HermesInstanceId;
label: string; // "Vijay / root", "Bheem / Uma"
user: string; // "root" | "uma"
hermesHome: string;
}
// add `instanceId: HermesInstanceId` to:
// HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview
Acceptance Criteria
This roadmap is complete when:
- The overview, ledger, agents, and history panes render real data for both Vijay and Bheem, filterable by instance; only panes without a real source remain (clearly labeled) seed data.
hermes-opsis cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests.- Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows 2/2 healthy with zero standing Bheem warnings.
- CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops.
- Hermes routes require auth and remain private-only; redact policies are decided and documented.
- Dashboard warnings reach the correct Telegram chat per instance.
Implementation Status Checklist
Update only with evidence (source review, tests, build output, or browser/VM verification).
- Phase 0 — Guardrails reconfirmed
- Phase 1 —
hermes-opshardened + tested - Phase 2 — Instance dimension + switcher
- Phase 3 — Real telemetry ingestion + panes converted
- Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill)
- Phase 5 — App/CI hardening (P0/P1/P2 done; P2 follow-ups in DEPLOYMENT.md mitigation roadmap remain)
- Phase 6 — UX polish
- Phase 7 — Security & access
- Phase 8 — Notifications & Telegram
Decisions (resolved 2026-05-30)
- Task data source — derive from real artifacts now; defer the JSONL pipeline. Hermes' real unit of work is the session (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists —
hermes sessions(+ stats),hermes cron list(+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite later and only if the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store. - Reading Bheem state — self-reporting ops exporter per instance. Each instance runs a tiny read-only exporter (Bheem as a
umauser-systemd timer, Vijay symmetrically) that writes a sanitized JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into/home/uma/.hermes. Interim stopgap until the exporter ships: replace the brittleps/existsSyncUma checks withrunuser -u uma -- systemctl --user is-active/is-enabled. - Products — repoint at the real service registry; drop the fabricated mock. The dashboard already has a live service registry (
backend/src/modules/services/, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands. - Auth — reuse platform-service JWT, defense-in-depth. Put the hermes routes behind the same platform-service auth (
backend/src/lib/auth.ts) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path). - Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes. Keep
umadev0931/uma_hostinger_hermes_vmand Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item.
Suggested Execution Order
- Phase 5 P0 (CI path + lint) — unblocks everything.
- Phase 1 (harden
hermes-ops) — the foundation the real UI sits on. - Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops.
- Phase 3 (real telemetry, pane by pane).
- Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last.
Each item is sized to land as a single PR with incremental commits to main.