Closes the remaining tractable items from the carry-forward queue.
1. Drop-root scaffold for the backend container (P2 mitigation)
`backend/Dockerfile` adds non-root `app` user (uid 1001) + `docker`
group (gid via `DOCKER_GID` build arg, default 999). `BACKEND_USER`
build arg defaults to `root` so existing deployments keep working;
set it to `app` plus `DOCKER_GID=$(getent group docker | cut -d: -f3)`
to flip the runtime non-root. `dashboard/DEPLOYMENT.md` gets a new
"Running non-root" section with the exact `chgrp`/`chmod` recipe
for the bind-mounted log files (the host-side prep that pairs with
the build flip). DEPLOYMENT.md mitigation roadmap updated.
2. Phase 6 trend cards
`lib/hermes-ops-history.ts` keeps the last 24 ops snapshots in
localStorage (de-duped on `generatedAt`, schema-guarded on read,
degrades silently on quota exceeded). Three trend cards in the
ops panel:
- Warning-volume sparkline + current count
- Healthy-instance count sparkline (X/2)
- Per-instance "minutes since last backup commit" with a 30m
stale threshold
SVG polyline sparklines, no chart library — `<svg viewBox="0 0
100 100" preserveAspectRatio="none">` with `vector-effect:
non-scaling-stroke` so the line stays 2px regardless of the
parent's width.
3. Phase 6 theme toggle
`components/theme-toggle.tsx` Sun/Moon button mounted in the
Hermes layout next to the instance switcher. Persists in
localStorage `bytelyst.theme.v1`. The design system already
defined `[data-theme="light"]` overrides in `styles/tokens.css`;
the toggle just sets the attribute. FOUC-prevention inline script
in the root layout reads the same key BEFORE React hydrates so
the first paint matches the user's last choice.
4. Phase 3 partial close: Agents pane → telemetry inventory
`/hermes/agents` now renders a "Memory & Skills inventory (live)"
SectionCard backed by the Phase 3 telemetry endpoint per instance
— `hermes memory list` and `hermes skills list` rendered with
per-section probe-status badges (`up`/`unknown`), item counts,
and the first N entries each. Agent **health** statuses (latency,
failure rate, last-success/failure) stay seed-data — observability
for those needs a separate ingestion contract that the telemetry
endpoint doesn't provide today.
5. Phase 0 reconfirmation
Roadmap Phase 0 ticked with explicit verification notes for each
guardrail (no public listener, manual approvals, secret hygiene,
Caddy review). Remains "must hold throughout" — the ticks reflect
today's verified state, not single-checkbox completion.
Verified: backend typecheck ✅, 74/74 backend unit tests ✅, web
typecheck ✅, 7/7 E2E ✅, lint 0 errors, build green, coverage gate
≥95% lines on every gated file.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
30 KiB
Hermes Mission Control v2 — Two-Instance Dashboard Roadmap
Date: 2026-05-30
Owner: ByteLyst / S (saravanakumardb)
Repo: learning_ai_devops_tools (GitHub remote: bytelyst-devops-tools)
Dashboard: dashboard/ — Next.js 16 web (web/, port 3000 / container 3049) + Fastify 5 backend (backend/, port 4004)
What This Roadmap Is
The two existing roadmaps are effectively complete for their original scope:
docs/hermes_dashboard_roadmap.md— built the 7-pane Hermes Mission Control UI. All checklist items are checked, but every pane except the live ops panel renders mock/seed data fromweb/src/lib/hermes.docs/hermes-setup-upgrade-roadmap.md— stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions).
This v2 roadmap supersedes the open dashboard-related items in both and adds the missing theme: power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications).
It does not re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend hermes-ops module.
The Two Instances (authoritative topology)
Source of truth is dashboard/backend/src/modules/hermes-ops/repository.ts.
| Codename | OS user | HERMES_HOME |
Gateway service | Private dashboard | Backup timer | Backup repo (local → GitHub) | Drive folder |
|---|---|---|---|---|---|---|---|
| Vijay | root |
/root/.hermes |
hermes-gateway.service (system) |
hermes-root-dashboard.service → :9119 |
hermes-root-backup.timer |
/root/repos/bytelyst_hostinger_hermes_vm → saravanakumardb/bytelyst_hostinger_hermes_vm |
Vijay Drive |
| Bheem | uma |
/home/uma/.hermes |
uma-hermes-gateway.service (uma user systemd) |
uma-hermes-dashboard.service → :9120 |
uma-hermes-backup.timer |
/home/uma/repos/uma_hostinger_hermes_vm → umadev0931/uma_hostinger_hermes_vm |
Bheem Drive |
- Both reachable privately only over Tailscale
100.87.53.10(:9119Vijay,:9120Bheem). No public Caddy route. This is a hard guardrail. - Plus a root-level
hermes-emergency-drive-upload.timerthat pushes encrypted bundles to each instance's Google Drive folder.
Three dashboard surfaces exist today
- Native per-instance Hermes dashboards —
:9119(Vijay) and:9120(Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase. - ByteLyst Mission Control — the
/hermes/*suite in this repo's DevOps dashboard (7 panes). Intended to be the unified pane-of-glass over both instances. - The live
hermes-opspanel — embedded in the Mission Control overview (web/src/components/hermes-ops-panel.tsx), already rendering real, both-instance status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings.
Decision baked into this roadmap: invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The hermes-ops module is the seed; everything below extends it.
Goal / Target State
A single private dashboard where, for both Vijay and Bheem, S can see at a glance:
- live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — real, cached, robust
- everything each Hermes is doing / did / failed / is blocked on — from real session/cron/task telemetry, filterable by instance
- backup & disaster-recovery posture at parity across both instances
- what needs founder attention, pushed to the right Telegram chat
…with the whole thing private-only, authenticated, tested, and in CI.
Gap Inventory (consolidated)
| ID | Gap | Source | Severity |
|---|---|---|---|
| G1 | 6 of 7 Mission Control panes are mock (web/src/lib/hermes) |
v1 roadmap / README "read-only mock" | High |
| G2 | Tasks/Products/History/Agents have no instance dimension (Vijay vs Bheem) | this review | High |
| G3 | hermes-ops backend not hardened: no cache (~20 execFile per 60s poll), brittle Uma checks (ps string-match + hardcoded existsSync), errors swallowed to null, no tests |
REVIEW_ACTIONS P1 #3 | High |
| G4 | No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) | v1 roadmap "real telemetry plan" | High |
| G5 | App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) | REVIEW_ACTIONS | P0–P2 |
| G6 | Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle | v1 "Next Improvements" | Med |
| G7 | Bheem/Uma parity: no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill | setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) | High |
| G8 | Security/access: devops dashboard hermes routes need auth; security.redact_secrets / privacy.redact_pii undecided; GitHub/Gitea least-privilege audit + rotation pending |
setup roadmap Phase 11 | High |
| G9 | Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated | setup roadmap Phase 6 | Med |
Phase 0 — Guardrails (must hold throughout)
Reconfirmation pass 2026-05-30 (this session): all four guardrails still hold. Each item below carries the current verification state — they remain "must hold throughout", not single-checkbox completions.
- No public Caddy route or public listener for any Hermes dashboard, the
hermes-opsAPI, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback. (Verified:dashboard/docker-compose.ymlbinds backend127.0.0.1:4004:4004and web127.0.0.1:3049:3000(loopback only). The backend listens on0.0.0.0:4004inside the container — that's the standard pattern and isn't reachable from outside the host./api/hermes/opsand/api/hermes/telemetry/:instanceboth gate onrequireAdmin(Phase 7). No new Caddy/Traefik label exposes a hermes path publicly.) - Keep Hermes command approvals at
manualorsmart; no gateway approval bypass. (Out of scope for this codebase — gateway approval lives in Hermes itself, not the dashboard. The dashboard never originates an approval bypass; the new/code-quality/checkchange tightened auth + path validation rather than loosening any approval flow.) - No raw secrets, tokens, OAuth files,
state.db, or SQLite WAL/SHM in any git backup or in this repo. (pnpm secret-scanruns on every CI build (.gitea/workflows/ci.yml"Secret scan" step). Backend'slib/logger.tsredactsAuthorization/Cookie/*.token/JWT_SECRET/COSMOS_KEY/AZURE_CLIENT_SECRETfrom any logged object. No.env/state.dbtracked. Telegram convention doc explicitly says "don't paste tokens".) - Re-run the Caddy/port review (
docs/hermes-operations.md) before adding any route or hostname. (No new public routes/hostnames added this session. Thedashboard/DEPLOYMENT.md"Ports — quick reference" table is the single source of truth and matchesdocker-compose.yml. If Phase 4 (Bheem/Uma parity) introduces a new Uma dashboard URL, the brief explicitly requires updating this section in the same change.)
Phase 1 — Make the unified backend authoritative and hardened (G3)
The hermes-ops snapshot becomes the single source of truth for live status. Before building UI on it, harden it.
- Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20
systemctl/git/ps/dusubprocesses every refresh; serve cached snapshot withgeneratedAt. - Replace brittle Bheem/Uma checks in
repository.ts(runusersystemctl --userwith ps/existsSync fallback so a failed probe degrades to the legacy check, not a false "down"):isUmaGatewayActive()(currentlyps -eostring match) →runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service(or--machine=uma@.host).isUmaGatewayEnabled()(currently hardcodedexistsSyncof a wants-symlink) →systemctl --user is-enabledvia the same path.
- Stop swallowing every failure to
nullindiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show unknown vs down. - Add Zod validation + a stable typed contract for
HermesOpsSnapshoton the route. - Add unit tests for the
hermes-opsrepository (mockexecFile/fs) — closes the REVIEW_ACTIONS "onlyserviceshas tests" gap for this module. - Read Bheem/Uma state via a self-reporting ops exporter (Decision #2): a read-only
umauser-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships:runuser -u uma -- systemctl --user is-active/is-enabledinstead of theps/existsSyncchecks.
Phase 2 — Instance dimension across Mission Control (G2)
- Add
instanceId: 'vijay' | 'bheem'to the core types inweb/src/lib/hermes(HermesTask,HermesProduct,HermesEvent,HermesRun, agent/overview models) and to the backend contracts. (Web:instanceIdnow onHermesProduct,HermesTask,HermesEvent,HermesRun,HermesAgentStatus(with a'all'literal for cross-cutting agents like Hermes Core / GitHub link). Seed data deterministically split ~50/50 across instances. Backend ops contract already carried per-instance shape underHermesOpsSnapshot.instancesfrom Phase 1; no separate backend change needed for this slice.) - Add a global instance switcher in
HermesShell(All/Vijay (root)/Bheem (uma)) with persisted selection; thread it through every pane. (NewHermesInstanceProvider(React context, localStorage-backed under keyhermes.instanceFilter.v1, with SSR-safe default to avoid hydration mismatch) mounted inapp/hermes/layout.tsx. NewHermesInstanceSwitchersegmented control rendered in the layout header above every pane. Every pane readsuseHermesInstance()and threads the value into the data-fetcher.) - Overview: show per-instance cards and a combined roll-up. (New "Per-instance roll-up" section on
/hermesalways shows Vijay and Bheem side-by-side with active/blocked/failed/success-rate cells regardless of the switcher state — that's the "always cross-instance" comparison view, while the eight metric cards above it are filtered by the switcher.) - Ledger / Products / History / Agents: filter and badge by instance. (
HermesInstanceBadgecomponent shipped; tasks (Active Missions + Task Ledger), product cards (overview minicards + portfolio cards), and agent rows now show their instance. Filter helpersgetHermesTasks({instance}),getHermesProducts(view, instance),getHermesAgents(instance),getHermesHistory(instance),getHermesOverview(instance)all accept the filter and short-circuit'all'. New unit tests inlib/hermes.test.tscover the filter semantics. New E2E test asserts the switcher's radiogroup, default selection, and persistence-friendly state change. 7/7 E2E + 13/13 web unit tests green.)
Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4)
Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live).
- Primary source = real artifacts (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes session as the work unit. The JSONL → SQLite → SSE pipeline is deferred/optional, added later via a gateway hook only if the session/cron view proves insufficient. (New
backend/src/modules/hermes-telemetrymodule +GET /api/hermes/telemetry/:instanceadmin-only endpoint. Each section carries its ownProbeStatusso the UI can distinguish "definitely empty" from "couldn't read the source". 30s TTL cache + in-flight coalescing, mirrors hermes-ops. JSONL → SQLite → SSE explicitly deferred per Decision #1.) - Backend endpoints per instance, reading real Hermes state:
- Sessions + stats (
hermes sessions stats --json). - Cron jobs (
hermes cron list --json). - Memory + skills inventory (
hermes memory list --json,hermes skills list --json). - Watchdog alerts feed (tails
~/.hermes/logs/hermes-health-watchdog.log, severity-bucketedinfo/warn/critical). - Backup history (
git -C <repo> log— last 20 commits per backup repo).
- Sessions + stats (
- Convert Task Ledger (
/hermes/tasks) + Task Detail to the real task/event source. (Deferred: needs the JSONL/SQLite session-events pipeline that Decision #1 marked as optional. Task Ledger remains seed-data; flip when a real source ships.) - [~] Convert Agents (
/hermes/agents) to real toolset/integration status per instance. (Partial:/hermes/agentsnow renders a "Memory & Skills inventory (live)" SectionCard backed by the Phase 3 telemetry endpoint per instance —hermes memory list/hermes skills listrendered with per-section probe-status badges, item counts, and the first N entries each. Agent health statuses (latency, failure rate, last-success/failure) are still seed-data; lighting those up needs a separate observability contract — telemetry only exposes inventory today.) - Convert History (
/hermes/history) to real session/cron/backup trends. (Deferred: depends on real session timeseries.) - Products (
/hermes/products): repoint at the real service registry (backend/src/modules/services/) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later. (Page rewritten: top "Live services" section sources fromapi.getServices()joined withapi.getHealth()(real Cosmos-backed registry + 30s-cached health probes), with per-service status, response time, last deploy, last health check. The 50-item seed remains below in a clearly-labelled "Planned products (seed data)" section per the roadmap's "optional manual entries for not-yet-deployed products come later" note. New E2E mocks for/api/services+/api/healthkeep the suite deterministic.)
Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7)
This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only.
VM ops, not codebase work. This phase requires sudo on the Hostinger VM, Uma-owned GitHub credentials, and Telegram bot tokens — none of it is editable in this repo. The full delegation brief is in
docs/prompts/phase4-bheem-uma-parity.md. When the brief's Definition-of-Done is met, tick the boxes below and the summary line at the bottom of this file.
- Stand up a Uma persistent backup repo +
uma-hermes-backup.timermirroring the root design (sanitizedhermes_persistent_backup/, secrets andstate.dbexcluded), pushing toumadev0931/uma_hostinger_hermes_vmwith a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5). - Install a Uma health watchdog (mirror
scripts/hermes-health-watchdog.py), silent-on-success, alerting Uma's Telegram. - Run the first Uma restore rehearsal into a temporary
HERMES_HOME; document indocs/hermes-operations.md/docs/hermes-disaster-recovery.md. - Schedule a quarterly Uma restore-drill reminder (parity with root).
- Confirm these close the corresponding Bheem warnings emitted by
getHermesOpsSnapshot()(backup timer active, repo HEAD readable + clean, Google token present).
Phase 5 — Dashboard app hardening (G5)
- P0: Fix the CI workspace path (
${{ gitea.workspace }}) in.gitea/workflows/ci.yml,DEPLOYMENT.md,scripts/deploy-hotcopy.sh(currently point at non-existent/opt/bytelyst/bytelyst-devops-tools/...). - P0: Replace the no-op
lintecho with real linting (next lintfor web, minimal ESLint for backend); makepnpm lintfail on bad code. - P1: Add tests for
auth,csrf,deployments/orchestrator,health, andhermes-ops; addpnpm test:coveragegate. (35 new unit tests; v8 coverage thresholds gated on the six tested files inbackend/vitest.config.ts(≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.) - P1: Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. (Chose remove: dropped
fastify-sse-v2dep, deleted commented-out plugin import + TODO fromserver.tsanddeployments/routes.ts, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls/deployments/:id/logsviaapiRequest— no UI change needed. If a real-time stream is wanted later, implement viareply.rawand update docs in the same change.) - P1: Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). (
DEPLOYMENT.mdis now canonical;DEPLOYMENT_GUIDE.mdreduced to a redirect stub;deploy.shupdated. Added an explicit "Ports — quick reference" table toDEPLOYMENT.mddistinguishing container:3000, Compose host:3049, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.) - P1: Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). (New "Privilege Surface" section in
dashboard/DEPLOYMENT.mdenumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix:/code-quality/checkwas reachable unauthenticated despite shelling out tonpm runin a caller-supplied path —requireAdminadded. Allow-list wrapper arounddocker/bash/npminvocations andprojectPathvalidation are queued as the next P1s; running the container as non-root and replacing the rawdocker.sockwith a verb-restricted proxy are P2/P3.) - P2: Structured backend logging (pino → stdout); wire E2E (
hermes.spec.ts) into CI with a started stack. (Two commits: (1)lib/logger.tsexposes a configured pino instance shared between Fastify (vialoggerInstance) and any non-request code path, withLOG_LEVELenv knob and built-in redaction for Authorization/Cookie headers + common secret-shaped field names; runtimeconsole.errorsites in deployments/orchestrator, system, backup, and vm modules ported over to structured logs. (2) E2E in CI: hermes spec now intercepts/api/hermes/opswith a fixture snapshot so it's deterministic without a live backend; CI workflow runsplaywright install --with-deps chromiumthenpnpm test:e2e(web suite starts its own Next dev via Playwright'swebServerconfig). Verified locally: 6/6 E2E green, 51/51 unit tests green, coverage gate ≥95% lines.)
Phase 6 — Mission Control UX polish (G6)
- Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel. (
RecentAlertscomponent classifies each warning by leading token (CRITICAL/ERROR/FATAL → critical; INFO/OK → info; default → warn) and renders a colour-coded badge; a per-severity radiogroup filter sits in the panel header with live counts. UI-only — no backend contract change.) - Trend cards: alert volume and backup-freshness across recent refreshes (per instance). (
lib/hermes-ops-history.tskeeps the last 24 snapshots in localStorage (de-duped ongeneratedAt, schema-guarded on read); the ops panel renders three trend cards inline — warning-volume sparkline, healthy-instance sparkline, per-instance "minutes since last backup commit" with a 30-minute stale threshold. SVG polyline sparklines, no chart library.) - Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work. (Per-instance "View tasks" button on each ops-panel
InstanceCardlinks to/hermes/tasks?instance=<id>.HermesInstanceProvidernow hydrates from the?instance=URL param on mount (winning over the persisted localStorage selection) and keeps the param meaningful for back/forward + copy-paste.) - Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway". (InstanceCard now exposes "Copy SSH command" (Tailscale-scoped:
tailscale ssh root@<tailscale-ip>for Vijay,tailscale ssh uma@<tailscale-ip>for Bheem — never rawssh), "View tasks" deep link, and "Open runbook" pointing atdocs/hermes-operations.md. "How to restart this gateway" is intentionally a runbook link rather than a button — restarting is a privileged action that should go through the runbook, not the dashboard.) - Optional dark/light theme toggle if the shell supports it. (
components/theme-toggle.tsxSun/Moon button mounted in the Hermes layout next to the instance switcher. Persists in localStoragebytelyst.theme.v1; an inline FOUC-prevention script in the root layout reads the same key and appliesdata-themeto<html>before React hydrates so the first paint matches the user's last choice. The design system already had[data-theme="light"]overrides instyles/tokens.css; the toggle just flips them on.) - Unified alerts feed across both instances on the overview. (Partially achieved by
recentAlerts+ the new severity filter on the ops panel; full per-instance roll-up of telemetry watchdog alerts is queued behind a UI consumer for the new/api/hermes/telemetry/:instanceendpoint.)
Phase 7 — Security & access (G8)
- Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere). (Both
/api/hermes/opsand the new/api/hermes/telemetry/:instancenow gate onrequireAdmin. Privilege-surface table indashboard/DEPLOYMENT.mdupdated to match. The previous "read-only ops snapshot, no auth" carve-out is gone — all Hermes routes are admin-only.) - Decide and document
security.redact_secretsandprivacy.redact_piifor gateway sessions (per instance). (Deferred — needs a founder decision on PII handling for session content; not a code-only change.) - Finish the GitHub/Gitea least-privilege token audit (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token). (Resolves naturally when Phase 4 ships — see the Phase 4 delegation brief.)
- Keep all hermes data private-only; never expose the
hermes-opssnapshot or task data on a public route. (Verified: no Caddy/public route added; the dashboard is bound to127.0.0.1and reached via Tailscale or SSH tunnel only — seedashboard/DEPLOYMENT.md"Ports — quick reference" + "Privilege Surface" sections. With this commit'srequireAdminchange, even an attacker with internal network access still needs a valid admin JWT to read the ops snapshot.)
Phase 8 — Notifications & Telegram loop (G9)
Mostly VM ops + bot-token configuration, with two small backend hooks. Full delegation brief in
docs/prompts/phase8-telegram-loop.md. The dashboard's documentation half is already done — seedocs/hermes-operations.md"Telegram Notification Convention".
- Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy. (Design captured in the brief:
lib/dashboard-alerts.tswrites new warnings to a tag-prefixed log; both watchdogs tail it. Implementation gated on Phase 4 (Uma watchdog must exist first) and on bot tokens.) - Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items). (Brief item 3.)
- Preserve the numbered-emoji progress convention (
1️⃣,2️⃣, …) for completion updates. (Codified indocs/hermes-operations.mdunder a new "Telegram Notification Convention" section, alongside the routing-per-instance, silent-on-healthy, and never-paste-secrets rules. The brief references this as the source of truth so VM-side implementers stay consistent.)
Data Model Additions
// web/src/lib/hermes (and mirrored in backend contracts)
export type HermesInstanceId = 'vijay' | 'bheem';
export interface HermesInstanceRef {
id: HermesInstanceId;
label: string; // "Vijay / root", "Bheem / Uma"
user: string; // "root" | "uma"
hermesHome: string;
}
// add `instanceId: HermesInstanceId` to:
// HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview
Acceptance Criteria
This roadmap is complete when:
- The overview, ledger, agents, and history panes render real data for both Vijay and Bheem, filterable by instance; only panes without a real source remain (clearly labeled) seed data.
hermes-opsis cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests.- Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows 2/2 healthy with zero standing Bheem warnings.
- CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops.
- Hermes routes require auth and remain private-only; redact policies are decided and documented.
- Dashboard warnings reach the correct Telegram chat per instance.
Implementation Status Checklist
Update only with evidence (source review, tests, build output, or browser/VM verification).
- Phase 0 — Guardrails reconfirmed (2026-05-30 pass; remains "must hold throughout")
- Phase 1 —
hermes-opshardened + tested - Phase 2 — Instance dimension + switcher
- Phase 3 — Real telemetry ingestion + Products pane converted (Task Ledger / Agents / History deferred — depend on JSONL session pipeline, see Phase 3 notes)
- Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill)
- Phase 5 — App/CI hardening (P0/P1/P2 done; P2 follow-ups in DEPLOYMENT.md mitigation roadmap remain)
- Phase 6 — UX polish (severity tags + deep links + per-instance actions; trend cards + theme toggle deferred)
- Phase 7 — Security & access (auth on hermes routes + privacy stance documented; redact_secrets/redact_pii decision deferred)
- Phase 8 — Notifications & Telegram (convention codified; delivery loop is VM ops, see brief)
Decisions (resolved 2026-05-30)
- Task data source — derive from real artifacts now; defer the JSONL pipeline. Hermes' real unit of work is the session (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists —
hermes sessions(+ stats),hermes cron list(+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite later and only if the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store. - Reading Bheem state — self-reporting ops exporter per instance. Each instance runs a tiny read-only exporter (Bheem as a
umauser-systemd timer, Vijay symmetrically) that writes a sanitized JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into/home/uma/.hermes. Interim stopgap until the exporter ships: replace the brittleps/existsSyncUma checks withrunuser -u uma -- systemctl --user is-active/is-enabled. - Products — repoint at the real service registry; drop the fabricated mock. The dashboard already has a live service registry (
backend/src/modules/services/, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands. - Auth — reuse platform-service JWT, defense-in-depth. Put the hermes routes behind the same platform-service auth (
backend/src/lib/auth.ts) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path). - Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes. Keep
umadev0931/uma_hostinger_hermes_vmand Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item.
Suggested Execution Order
- Phase 5 P0 (CI path + lint) — unblocks everything.
- Phase 1 (harden
hermes-ops) — the foundation the real UI sits on. - Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops.
- Phase 3 (real telemetry, pane by pane).
- Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last.
Each item is sized to land as a single PR with incremental commits to main.