Closes the remaining tractable items from the carry-forward queue.
1. Drop-root scaffold for the backend container (P2 mitigation)
`backend/Dockerfile` adds non-root `app` user (uid 1001) + `docker`
group (gid via `DOCKER_GID` build arg, default 999). `BACKEND_USER`
build arg defaults to `root` so existing deployments keep working;
set it to `app` plus `DOCKER_GID=$(getent group docker | cut -d: -f3)`
to flip the runtime non-root. `dashboard/DEPLOYMENT.md` gets a new
"Running non-root" section with the exact `chgrp`/`chmod` recipe
for the bind-mounted log files (the host-side prep that pairs with
the build flip). DEPLOYMENT.md mitigation roadmap updated.
2. Phase 6 trend cards
`lib/hermes-ops-history.ts` keeps the last 24 ops snapshots in
localStorage (de-duped on `generatedAt`, schema-guarded on read,
degrades silently on quota exceeded). Three trend cards in the
ops panel:
- Warning-volume sparkline + current count
- Healthy-instance count sparkline (X/2)
- Per-instance "minutes since last backup commit" with a 30m
stale threshold
SVG polyline sparklines, no chart library — `<svg viewBox="0 0
100 100" preserveAspectRatio="none">` with `vector-effect:
non-scaling-stroke` so the line stays 2px regardless of the
parent's width.
3. Phase 6 theme toggle
`components/theme-toggle.tsx` Sun/Moon button mounted in the
Hermes layout next to the instance switcher. Persists in
localStorage `bytelyst.theme.v1`. The design system already
defined `[data-theme="light"]` overrides in `styles/tokens.css`;
the toggle just sets the attribute. FOUC-prevention inline script
in the root layout reads the same key BEFORE React hydrates so
the first paint matches the user's last choice.
4. Phase 3 partial close: Agents pane → telemetry inventory
`/hermes/agents` now renders a "Memory & Skills inventory (live)"
SectionCard backed by the Phase 3 telemetry endpoint per instance
— `hermes memory list` and `hermes skills list` rendered with
per-section probe-status badges (`up`/`unknown`), item counts,
and the first N entries each. Agent **health** statuses (latency,
failure rate, last-success/failure) stay seed-data — observability
for those needs a separate ingestion contract that the telemetry
endpoint doesn't provide today.
5. Phase 0 reconfirmation
Roadmap Phase 0 ticked with explicit verification notes for each
guardrail (no public listener, manual approvals, secret hygiene,
Caddy review). Remains "must hold throughout" — the ticks reflect
today's verified state, not single-checkbox completion.
Verified: backend typecheck ✅, 74/74 backend unit tests ✅, web
typecheck ✅, 7/7 E2E ✅, lint 0 errors, build green, coverage gate
≥95% lines on every gated file.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
222 lines
30 KiB
Markdown
222 lines
30 KiB
Markdown
# Hermes Mission Control v2 — Two-Instance Dashboard Roadmap
|
||
|
||
**Date:** 2026-05-30
|
||
**Owner:** ByteLyst / S (`saravanakumardb`)
|
||
**Repo:** `learning_ai_devops_tools` (GitHub remote: `bytelyst-devops-tools`)
|
||
**Dashboard:** `dashboard/` — Next.js 16 web (`web/`, port 3000 / container 3049) + Fastify 5 backend (`backend/`, port 4004)
|
||
|
||
## What This Roadmap Is
|
||
|
||
The two existing roadmaps are effectively complete for their original scope:
|
||
|
||
- `docs/hermes_dashboard_roadmap.md` — built the 7-pane **Hermes Mission Control** UI. All checklist items are checked, **but every pane except the live ops panel renders mock/seed data** from `web/src/lib/hermes`.
|
||
- `docs/hermes-setup-upgrade-roadmap.md` — stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions).
|
||
|
||
This v2 roadmap **supersedes the open dashboard-related items in both** and adds the missing theme: **power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap** (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications).
|
||
|
||
It does **not** re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend `hermes-ops` module.
|
||
|
||
## The Two Instances (authoritative topology)
|
||
|
||
Source of truth is `dashboard/backend/src/modules/hermes-ops/repository.ts`.
|
||
|
||
| Codename | OS user | `HERMES_HOME` | Gateway service | Private dashboard | Backup timer | Backup repo (local → GitHub) | Drive folder |
|
||
|----------|---------|---------------|-----------------|-------------------|--------------|------------------------------|--------------|
|
||
| **Vijay** | `root` | `/root/.hermes` | `hermes-gateway.service` (system) | `hermes-root-dashboard.service` → :9119 | `hermes-root-backup.timer` | `/root/repos/bytelyst_hostinger_hermes_vm` → `saravanakumardb/bytelyst_hostinger_hermes_vm` | Vijay Drive |
|
||
| **Bheem** | `uma` | `/home/uma/.hermes` | `uma-hermes-gateway.service` (uma **user** systemd) | `uma-hermes-dashboard.service` → :9120 | `uma-hermes-backup.timer` | `/home/uma/repos/uma_hostinger_hermes_vm` → `umadev0931/uma_hostinger_hermes_vm` | Bheem Drive |
|
||
|
||
- Both reachable **privately only** over Tailscale `100.87.53.10` (`:9119` Vijay, `:9120` Bheem). No public Caddy route. This is a hard guardrail.
|
||
- Plus a root-level `hermes-emergency-drive-upload.timer` that pushes encrypted bundles to each instance's Google Drive folder.
|
||
|
||
### Three dashboard surfaces exist today
|
||
|
||
1. **Native per-instance Hermes dashboards** — `:9119` (Vijay) and `:9120` (Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase.
|
||
2. **ByteLyst Mission Control** — the `/hermes/*` suite in this repo's DevOps dashboard (7 panes). **Intended to be the unified pane-of-glass over both instances.**
|
||
3. **The live `hermes-ops` panel** — embedded in the Mission Control overview (`web/src/components/hermes-ops-panel.tsx`), already rendering **real, both-instance** status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings.
|
||
|
||
**Decision baked into this roadmap:** invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The `hermes-ops` module is the seed; everything below extends it.
|
||
|
||
## Goal / Target State
|
||
|
||
A single private dashboard where, for **both Vijay and Bheem**, S can see at a glance:
|
||
|
||
- live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — **real, cached, robust**
|
||
- everything each Hermes is doing / did / failed / is blocked on — **from real session/cron/task telemetry, filterable by instance**
|
||
- backup & disaster-recovery posture **at parity across both instances**
|
||
- what needs founder attention, pushed to the right Telegram chat
|
||
|
||
…with the whole thing private-only, authenticated, tested, and in CI.
|
||
|
||
---
|
||
|
||
## Gap Inventory (consolidated)
|
||
|
||
| ID | Gap | Source | Severity |
|
||
|----|-----|--------|----------|
|
||
| G1 | 6 of 7 Mission Control panes are mock (`web/src/lib/hermes`) | v1 roadmap / README "read-only mock" | High |
|
||
| G2 | Tasks/Products/History/Agents have **no instance dimension** (Vijay vs Bheem) | this review | High |
|
||
| G3 | `hermes-ops` backend not hardened: no cache (~20 `execFile` per 60s poll), brittle Uma checks (`ps` string-match + hardcoded `existsSync`), errors swallowed to `null`, **no tests** | REVIEW_ACTIONS P1 #3 | High |
|
||
| G4 | No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) | v1 roadmap "real telemetry plan" | High |
|
||
| G5 | App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) | REVIEW_ACTIONS | P0–P2 |
|
||
| G6 | Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle | v1 "Next Improvements" | Med |
|
||
| G7 | **Bheem/Uma parity:** no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill | setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) | High |
|
||
| G8 | Security/access: devops dashboard hermes routes need auth; `security.redact_secrets` / `privacy.redact_pii` undecided; GitHub/Gitea least-privilege audit + rotation pending | setup roadmap Phase 11 | High |
|
||
| G9 | Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated | setup roadmap Phase 6 | Med |
|
||
|
||
---
|
||
|
||
## Phase 0 — Guardrails (must hold throughout)
|
||
|
||
> Reconfirmation pass 2026-05-30 (this session): all four guardrails still
|
||
> hold. Each item below carries the current verification state — they remain
|
||
> "must hold throughout", not single-checkbox completions.
|
||
|
||
- [x] No public Caddy route or public listener for **any** Hermes dashboard, the `hermes-ops` API, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback. *(Verified: `dashboard/docker-compose.yml` binds backend `127.0.0.1:4004:4004` and web `127.0.0.1:3049:3000` (loopback only). The backend listens on `0.0.0.0:4004` **inside the container** — that's the standard pattern and isn't reachable from outside the host. `/api/hermes/ops` and `/api/hermes/telemetry/:instance` both gate on `requireAdmin` (Phase 7). No new Caddy/Traefik label exposes a hermes path publicly.)*
|
||
- [x] Keep Hermes command approvals at `manual` or `smart`; no gateway approval bypass. *(Out of scope for this codebase — gateway approval lives in Hermes itself, not the dashboard. The dashboard never originates an approval bypass; the new `/code-quality/check` change tightened auth + path validation rather than loosening any approval flow.)*
|
||
- [x] No raw secrets, tokens, OAuth files, `state.db`, or SQLite WAL/SHM in any git backup or in this repo. *(`pnpm secret-scan` runs on every CI build (`.gitea/workflows/ci.yml` "Secret scan" step). Backend's `lib/logger.ts` redacts `Authorization`/`Cookie`/`*.token`/`JWT_SECRET`/`COSMOS_KEY`/`AZURE_CLIENT_SECRET` from any logged object. No `.env`/`state.db` tracked. Telegram convention doc explicitly says "don't paste tokens".)*
|
||
- [x] Re-run the Caddy/port review (`docs/hermes-operations.md`) before adding any route or hostname. *(No new public routes/hostnames added this session. The `dashboard/DEPLOYMENT.md` "Ports — quick reference" table is the single source of truth and matches `docker-compose.yml`. If Phase 4 (Bheem/Uma parity) introduces a new Uma dashboard URL, the brief explicitly requires updating this section in the same change.)*
|
||
|
||
## Phase 1 — Make the unified backend authoritative and hardened (G3)
|
||
|
||
The `hermes-ops` snapshot becomes the single source of truth for live status. Before building UI on it, harden it.
|
||
|
||
- [x] Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20 `systemctl`/`git`/`ps`/`du` subprocesses every refresh; serve cached snapshot with `generatedAt`.
|
||
- [x] Replace brittle Bheem/Uma checks in `repository.ts` *(runuser `systemctl --user` with ps/existsSync fallback so a failed probe degrades to the legacy check, not a false "down")*:
|
||
- `isUmaGatewayActive()` (currently `ps -eo` string match) → `runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service` (or `--machine=uma@.host`).
|
||
- `isUmaGatewayEnabled()` (currently hardcoded `existsSync` of a wants-symlink) → `systemctl --user is-enabled` via the same path.
|
||
- [x] Stop swallowing every failure to `null` indiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show *unknown* vs *down*.
|
||
- [x] Add Zod validation + a stable typed contract for `HermesOpsSnapshot` on the route.
|
||
- [x] **Add unit tests for the `hermes-ops` repository** (mock `execFile`/fs) — closes the REVIEW_ACTIONS "only `services` has tests" gap for this module.
|
||
- [ ] Read Bheem/Uma state via a **self-reporting ops exporter** (Decision #2): a read-only `uma` user-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships: `runuser -u uma -- systemctl --user is-active/is-enabled` instead of the `ps`/`existsSync` checks.
|
||
|
||
## Phase 2 — Instance dimension across Mission Control (G2)
|
||
|
||
- [x] Add `instanceId: 'vijay' | 'bheem'` to the core types in `web/src/lib/hermes` (`HermesTask`, `HermesProduct`, `HermesEvent`, `HermesRun`, agent/overview models) and to the backend contracts. *(Web: `instanceId` now on `HermesProduct`, `HermesTask`, `HermesEvent`, `HermesRun`, `HermesAgentStatus` (with a `'all'` literal for cross-cutting agents like Hermes Core / GitHub link). Seed data deterministically split ~50/50 across instances. Backend ops contract already carried per-instance shape under `HermesOpsSnapshot.instances` from Phase 1; no separate backend change needed for this slice.)*
|
||
- [x] Add a global **instance switcher** in `HermesShell` (`All` / `Vijay (root)` / `Bheem (uma)`) with persisted selection; thread it through every pane. *(New `HermesInstanceProvider` (React context, localStorage-backed under key `hermes.instanceFilter.v1`, with SSR-safe default to avoid hydration mismatch) mounted in `app/hermes/layout.tsx`. New `HermesInstanceSwitcher` segmented control rendered in the layout header above every pane. Every pane reads `useHermesInstance()` and threads the value into the data-fetcher.)*
|
||
- [x] Overview: show per-instance cards **and** a combined roll-up. *(New "Per-instance roll-up" section on `/hermes` always shows Vijay and Bheem side-by-side with active/blocked/failed/success-rate cells regardless of the switcher state — that's the "always cross-instance" comparison view, while the eight metric cards above it are filtered by the switcher.)*
|
||
- [x] Ledger / Products / History / Agents: filter and badge by instance. *(`HermesInstanceBadge` component shipped; tasks (Active Missions + Task Ledger), product cards (overview minicards + portfolio cards), and agent rows now show their instance. Filter helpers `getHermesTasks({instance})`, `getHermesProducts(view, instance)`, `getHermesAgents(instance)`, `getHermesHistory(instance)`, `getHermesOverview(instance)` all accept the filter and short-circuit `'all'`. New unit tests in `lib/hermes.test.ts` cover the filter semantics. New E2E test asserts the switcher's radiogroup, default selection, and persistence-friendly state change. 7/7 E2E + 13/13 web unit tests green.)*
|
||
|
||
## Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4)
|
||
|
||
Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live).
|
||
|
||
- [x] **Primary source = real artifacts** (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes *session* as the work unit. The JSONL → SQLite → SSE pipeline is **deferred/optional**, added later via a gateway hook only if the session/cron view proves insufficient. *(New `backend/src/modules/hermes-telemetry` module + `GET /api/hermes/telemetry/:instance` admin-only endpoint. Each section carries its own `ProbeStatus` so the UI can distinguish "definitely empty" from "couldn't read the source". 30s TTL cache + in-flight coalescing, mirrors hermes-ops. JSONL → SQLite → SSE explicitly deferred per Decision #1.)*
|
||
- [x] Backend endpoints per instance, reading real Hermes state:
|
||
- [x] Sessions + stats (`hermes sessions stats --json`).
|
||
- [x] Cron jobs (`hermes cron list --json`).
|
||
- [x] Memory + skills inventory (`hermes memory list --json`, `hermes skills list --json`).
|
||
- [x] Watchdog alerts feed (tails `~/.hermes/logs/hermes-health-watchdog.log`, severity-bucketed `info`/`warn`/`critical`).
|
||
- [x] Backup history (`git -C <repo> log` — last 20 commits per backup repo).
|
||
- [ ] Convert **Task Ledger** (`/hermes/tasks`) + **Task Detail** to the real task/event source. *(Deferred: needs the JSONL/SQLite session-events pipeline that Decision #1 marked as optional. Task Ledger remains seed-data; flip when a real source ships.)*
|
||
- [~] Convert **Agents** (`/hermes/agents`) to real toolset/integration status per instance. *(Partial: `/hermes/agents` now renders a "Memory & Skills inventory (live)" SectionCard backed by the Phase 3 telemetry endpoint per instance — `hermes memory list` / `hermes skills list` rendered with per-section probe-status badges, item counts, and the first N entries each. Agent **health** statuses (latency, failure rate, last-success/failure) are still seed-data; lighting those up needs a separate observability contract — telemetry only exposes inventory today.)*
|
||
- [ ] Convert **History** (`/hermes/history`) to real session/cron/backup trends. *(Deferred: depends on real session timeseries.)*
|
||
- [x] **Products** (`/hermes/products`): repoint at the real service registry (`backend/src/modules/services/`) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later. *(Page rewritten: top "Live services" section sources from `api.getServices()` joined with `api.getHealth()` (real Cosmos-backed registry + 30s-cached health probes), with per-service status, response time, last deploy, last health check. The 50-item seed remains below in a clearly-labelled "Planned products (seed data)" section per the roadmap's "optional manual entries for not-yet-deployed products come later" note. New E2E mocks for `/api/services` + `/api/health` keep the suite deterministic.)*
|
||
|
||
## Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7)
|
||
|
||
This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only.
|
||
|
||
> **VM ops, not codebase work.** This phase requires sudo on the Hostinger VM, Uma-owned GitHub credentials, and Telegram bot tokens — none of it is editable in this repo. The full delegation brief is in [`docs/prompts/phase4-bheem-uma-parity.md`](./prompts/phase4-bheem-uma-parity.md). When the brief's Definition-of-Done is met, tick the boxes below and the summary line at the bottom of this file.
|
||
|
||
- [ ] Stand up a **Uma persistent backup repo + `uma-hermes-backup.timer`** mirroring the root design (sanitized `hermes_persistent_backup/`, secrets and `state.db` excluded), pushing to `umadev0931/uma_hostinger_hermes_vm` **with a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5)**.
|
||
- [ ] Install a **Uma health watchdog** (mirror `scripts/hermes-health-watchdog.py`), silent-on-success, alerting Uma's Telegram.
|
||
- [ ] Run the **first Uma restore rehearsal** into a temporary `HERMES_HOME`; document in `docs/hermes-operations.md` / `docs/hermes-disaster-recovery.md`.
|
||
- [ ] Schedule a **quarterly Uma restore-drill reminder** (parity with root).
|
||
- [ ] Confirm these close the corresponding Bheem warnings emitted by `getHermesOpsSnapshot()` (backup timer active, repo HEAD readable + clean, Google token present).
|
||
|
||
## Phase 5 — Dashboard app hardening (G5)
|
||
|
||
- [x] **P0:** Fix the CI workspace path (`${{ gitea.workspace }}`) in `.gitea/workflows/ci.yml`, `DEPLOYMENT.md`, `scripts/deploy-hotcopy.sh` (currently point at non-existent `/opt/bytelyst/bytelyst-devops-tools/...`).
|
||
- [x] **P0:** Replace the no-op `lint` echo with real linting (`next lint` for web, minimal ESLint for backend); make `pnpm lint` fail on bad code.
|
||
- [x] **P1:** Add tests for `auth`, `csrf`, `deployments/orchestrator`, `health`, **and `hermes-ops`**; add `pnpm test:coverage` gate. *(35 new unit tests; v8 coverage thresholds gated on the six tested files in `backend/vitest.config.ts` (≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.)*
|
||
- [x] **P1:** Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. *(Chose **remove**: dropped `fastify-sse-v2` dep, deleted commented-out plugin import + TODO from `server.ts` and `deployments/routes.ts`, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls `/deployments/:id/logs` via `apiRequest` — no UI change needed. If a real-time stream is wanted later, implement via `reply.raw` and update docs in the same change.)*
|
||
- [x] **P1:** Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). *(`DEPLOYMENT.md` is now canonical; `DEPLOYMENT_GUIDE.md` reduced to a redirect stub; `deploy.sh` updated. Added an explicit "Ports — quick reference" table to `DEPLOYMENT.md` distinguishing container `:3000`, Compose host `:3049`, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.)*
|
||
- [x] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). *(New "Privilege Surface" section in `dashboard/DEPLOYMENT.md` enumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix: `/code-quality/check` was reachable unauthenticated despite shelling out to `npm run` in a caller-supplied path — `requireAdmin` added. Allow-list wrapper around `docker`/`bash`/`npm` invocations and `projectPath` validation are queued as the next P1s; running the container as non-root and replacing the raw `docker.sock` with a verb-restricted proxy are P2/P3.)*
|
||
- [x] **P2:** Structured backend logging (pino → stdout); wire E2E (`hermes.spec.ts`) into CI with a started stack. *(Two commits: (1) `lib/logger.ts` exposes a configured pino instance shared between Fastify (via `loggerInstance`) and any non-request code path, with `LOG_LEVEL` env knob and built-in redaction for Authorization/Cookie headers + common secret-shaped field names; runtime `console.error` sites in deployments/orchestrator, system, backup, and vm modules ported over to structured logs. (2) E2E in CI: hermes spec now intercepts `/api/hermes/ops` with a fixture snapshot so it's deterministic without a live backend; CI workflow runs `playwright install --with-deps chromium` then `pnpm test:e2e` (web suite starts its own Next dev via Playwright's `webServer` config). Verified locally: 6/6 E2E green, 51/51 unit tests green, coverage gate ≥95% lines.)*
|
||
|
||
## Phase 6 — Mission Control UX polish (G6)
|
||
|
||
- [x] Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel. *(`RecentAlerts` component classifies each warning by leading token (CRITICAL/ERROR/FATAL → critical; INFO/OK → info; default → warn) and renders a colour-coded badge; a per-severity radiogroup filter sits in the panel header with live counts. UI-only — no backend contract change.)*
|
||
- [x] Trend cards: alert volume and backup-freshness across recent refreshes (per instance). *(`lib/hermes-ops-history.ts` keeps the last 24 snapshots in localStorage (de-duped on `generatedAt`, schema-guarded on read); the ops panel renders three trend cards inline — warning-volume sparkline, healthy-instance sparkline, per-instance "minutes since last backup commit" with a 30-minute stale threshold. SVG polyline sparklines, no chart library.)*
|
||
- [x] Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work. *(Per-instance "View tasks" button on each ops-panel `InstanceCard` links to `/hermes/tasks?instance=<id>`. `HermesInstanceProvider` now hydrates from the `?instance=` URL param on mount (winning over the persisted localStorage selection) and keeps the param meaningful for back/forward + copy-paste.)*
|
||
- [x] Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway". *(InstanceCard now exposes "Copy SSH command" (Tailscale-scoped: `tailscale ssh root@<tailscale-ip>` for Vijay, `tailscale ssh uma@<tailscale-ip>` for Bheem — never raw `ssh`), "View tasks" deep link, and "Open runbook" pointing at `docs/hermes-operations.md`. "How to restart this gateway" is intentionally a runbook link rather than a button — restarting is a privileged action that should go through the runbook, not the dashboard.)*
|
||
- [x] Optional dark/light theme toggle if the shell supports it. *(`components/theme-toggle.tsx` Sun/Moon button mounted in the Hermes layout next to the instance switcher. Persists in localStorage `bytelyst.theme.v1`; an inline FOUC-prevention script in the root layout reads the same key and applies `data-theme` to `<html>` before React hydrates so the first paint matches the user's last choice. The design system already had `[data-theme="light"]` overrides in `styles/tokens.css`; the toggle just flips them on.)*
|
||
- [ ] Unified alerts feed across both instances on the overview. *(Partially achieved by `recentAlerts` + the new severity filter on the ops panel; full per-instance roll-up of telemetry watchdog alerts is queued behind a UI consumer for the new `/api/hermes/telemetry/:instance` endpoint.)*
|
||
|
||
## Phase 7 — Security & access (G8)
|
||
|
||
- [x] Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere). *(Both `/api/hermes/ops` and the new `/api/hermes/telemetry/:instance` now gate on `requireAdmin`. Privilege-surface table in `dashboard/DEPLOYMENT.md` updated to match. The previous "read-only ops snapshot, no auth" carve-out is gone — all Hermes routes are admin-only.)*
|
||
- [ ] Decide and document `security.redact_secrets` and `privacy.redact_pii` for gateway sessions (per instance). *(Deferred — needs a founder decision on PII handling for session content; not a code-only change.)*
|
||
- [ ] Finish the GitHub/Gitea **least-privilege token audit** (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token). *(Resolves naturally when Phase 4 ships — see the Phase 4 delegation brief.)*
|
||
- [x] Keep all hermes data private-only; never expose the `hermes-ops` snapshot or task data on a public route. *(Verified: no Caddy/public route added; the dashboard is bound to `127.0.0.1` and reached via Tailscale or SSH tunnel only — see `dashboard/DEPLOYMENT.md` "Ports — quick reference" + "Privilege Surface" sections. With this commit's `requireAdmin` change, even an attacker with internal network access still needs a valid admin JWT to read the ops snapshot.)*
|
||
|
||
## Phase 8 — Notifications & Telegram loop (G9)
|
||
|
||
> **Mostly VM ops + bot-token configuration**, with two small backend hooks. Full delegation brief in [`docs/prompts/phase8-telegram-loop.md`](./prompts/phase8-telegram-loop.md). The dashboard's documentation half is already done — see `docs/hermes-operations.md` "Telegram Notification Convention".
|
||
|
||
- [ ] Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy. *(Design captured in the brief: `lib/dashboard-alerts.ts` writes new warnings to a tag-prefixed log; both watchdogs tail it. Implementation gated on Phase 4 (Uma watchdog must exist first) and on bot tokens.)*
|
||
- [ ] Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items). *(Brief item 3.)*
|
||
- [x] Preserve the numbered-emoji progress convention (`1️⃣`, `2️⃣`, …) for completion updates. *(Codified in `docs/hermes-operations.md` under a new "Telegram Notification Convention" section, alongside the routing-per-instance, silent-on-healthy, and never-paste-secrets rules. The brief references this as the source of truth so VM-side implementers stay consistent.)*
|
||
|
||
---
|
||
|
||
## Data Model Additions
|
||
|
||
```ts
|
||
// web/src/lib/hermes (and mirrored in backend contracts)
|
||
export type HermesInstanceId = 'vijay' | 'bheem';
|
||
|
||
export interface HermesInstanceRef {
|
||
id: HermesInstanceId;
|
||
label: string; // "Vijay / root", "Bheem / Uma"
|
||
user: string; // "root" | "uma"
|
||
hermesHome: string;
|
||
}
|
||
|
||
// add `instanceId: HermesInstanceId` to:
|
||
// HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview
|
||
```
|
||
|
||
## Acceptance Criteria
|
||
|
||
This roadmap is complete when:
|
||
|
||
- [ ] The overview, ledger, agents, and history panes render **real data for both Vijay and Bheem**, filterable by instance; only panes without a real source remain (clearly labeled) seed data.
|
||
- [ ] `hermes-ops` is cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests.
|
||
- [ ] Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows **2/2 healthy** with zero standing Bheem warnings.
|
||
- [ ] CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops.
|
||
- [ ] Hermes routes require auth and remain private-only; redact policies are decided and documented.
|
||
- [ ] Dashboard warnings reach the correct Telegram chat per instance.
|
||
|
||
## Implementation Status Checklist
|
||
|
||
Update only with evidence (source review, tests, build output, or browser/VM verification).
|
||
|
||
- [x] Phase 0 — Guardrails reconfirmed (2026-05-30 pass; remains "must hold throughout")
|
||
- [x] Phase 1 — `hermes-ops` hardened + tested
|
||
- [x] Phase 2 — Instance dimension + switcher
|
||
- [x] Phase 3 — Real telemetry ingestion + Products pane converted (Task Ledger / Agents / History deferred — depend on JSONL session pipeline, see Phase 3 notes)
|
||
- [ ] Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill)
|
||
- [x] Phase 5 — App/CI hardening (P0/P1/P2 done; P2 follow-ups in DEPLOYMENT.md mitigation roadmap remain)
|
||
- [x] Phase 6 — UX polish (severity tags + deep links + per-instance actions; trend cards + theme toggle deferred)
|
||
- [x] Phase 7 — Security & access (auth on hermes routes + privacy stance documented; redact_secrets/redact_pii decision deferred)
|
||
- [ ] Phase 8 — Notifications & Telegram (convention codified; delivery loop is VM ops, see brief)
|
||
|
||
## Decisions (resolved 2026-05-30)
|
||
|
||
1. **Task data source — derive from real artifacts now; defer the JSONL pipeline.** Hermes' real unit of work is the *session* (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists — `hermes sessions` (+ stats), `hermes cron list` (+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite **later and only if** the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store.
|
||
2. **Reading Bheem state — self-reporting ops exporter per instance.** Each instance runs a tiny read-only exporter (Bheem as a `uma` user-systemd timer, Vijay symmetrically) that writes a **sanitized** JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into `/home/uma/.hermes`. **Interim stopgap** until the exporter ships: replace the brittle `ps`/`existsSync` Uma checks with `runuser -u uma -- systemctl --user is-active/is-enabled`.
|
||
3. **Products — repoint at the real service registry; drop the fabricated mock.** The dashboard already has a live service registry (`backend/src/modules/services/`, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands.
|
||
4. **Auth — reuse platform-service JWT, defense-in-depth.** Put the hermes routes behind the same platform-service auth (`backend/src/lib/auth.ts`) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path).
|
||
5. **Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes.** Keep `umadev0931/uma_hostinger_hermes_vm` and Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item.
|
||
|
||
## Suggested Execution Order
|
||
|
||
1. Phase 5 P0 (CI path + lint) — unblocks everything.
|
||
2. Phase 1 (harden `hermes-ops`) — the foundation the real UI sits on.
|
||
3. Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops.
|
||
4. Phase 3 (real telemetry, pane by pane).
|
||
5. Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last.
|
||
|
||
Each item is sized to land as a single PR with incremental commits to `main`.
|