diff --git a/docs/hermes_dashboard_v2_roadmap.md b/docs/hermes_dashboard_v2_roadmap.md new file mode 100644 index 0000000..dfe33c8 --- /dev/null +++ b/docs/hermes_dashboard_v2_roadmap.md @@ -0,0 +1,213 @@ +# Hermes Mission Control v2 — Two-Instance Dashboard Roadmap + +**Date:** 2026-05-30 +**Owner:** ByteLyst / S (`saravanakumardb`) +**Repo:** `learning_ai_devops_tools` (GitHub remote: `bytelyst-devops-tools`) +**Dashboard:** `dashboard/` — Next.js 16 web (`web/`, port 3000 / container 3049) + Fastify 5 backend (`backend/`, port 4004) + +## What This Roadmap Is + +The two existing roadmaps are effectively complete for their original scope: + +- `docs/hermes_dashboard_roadmap.md` — built the 7-pane **Hermes Mission Control** UI. All checklist items are checked, **but every pane except the live ops panel renders mock/seed data** from `web/src/lib/hermes`. +- `docs/hermes-setup-upgrade-roadmap.md` — stood up and hardened the two Hermes instances on the VM (~68% checked; the open items are mostly Uma parity, credentials, and policy decisions). + +This v2 roadmap **supersedes the open dashboard-related items in both** and adds the missing theme: **power the unified dashboard with real data from BOTH Hermes instances — Vijay/root and Bheem/Uma — and close every known gap** (mock data, backend hardening, two-instance parity, app/CI hygiene, UX polish, security, and notifications). + +It does **not** re-do anything already verified in v1. It builds on the one piece that is already real and two-instance-aware: the backend `hermes-ops` module. + +## The Two Instances (authoritative topology) + +Source of truth is `dashboard/backend/src/modules/hermes-ops/repository.ts`. + +| Codename | OS user | `HERMES_HOME` | Gateway service | Private dashboard | Backup timer | Backup repo (local → GitHub) | Drive folder | +|----------|---------|---------------|-----------------|-------------------|--------------|------------------------------|--------------| +| **Vijay** | `root` | `/root/.hermes` | `hermes-gateway.service` (system) | `hermes-root-dashboard.service` → :9119 | `hermes-root-backup.timer` | `/root/repos/bytelyst_hostinger_hermes_vm` → `saravanakumardb/bytelyst_hostinger_hermes_vm` | Vijay Drive | +| **Bheem** | `uma` | `/home/uma/.hermes` | `uma-hermes-gateway.service` (uma **user** systemd) | `uma-hermes-dashboard.service` → :9120 | `uma-hermes-backup.timer` | `/home/uma/repos/uma_hostinger_hermes_vm` → `umadev0931/uma_hostinger_hermes_vm` | Bheem Drive | + +- Both reachable **privately only** over Tailscale `100.87.53.10` (`:9119` Vijay, `:9120` Bheem). No public Caddy route. This is a hard guardrail. +- Plus a root-level `hermes-emergency-drive-upload.timer` that pushes encrypted bundles to each instance's Google Drive folder. + +### Three dashboard surfaces exist today + +1. **Native per-instance Hermes dashboards** — `:9119` (Vijay) and `:9120` (Bheem), one per user, over Tailscale. Operationally scoped, separate from this codebase. +2. **ByteLyst Mission Control** — the `/hermes/*` suite in this repo's DevOps dashboard (7 panes). **Intended to be the unified pane-of-glass over both instances.** +3. **The live `hermes-ops` panel** — embedded in the Mission Control overview (`web/src/components/hermes-ops-panel.tsx`), already rendering **real, both-instance** status: gateways, private dashboards, backup timers, repo HEAD/cleanliness, Google token, restore-payload counts, cron timers, emergency drive, Tailscale IP, active session count, and warnings. + +**Decision baked into this roadmap:** invest in surface #2/#3 — make unified Mission Control the real two-instance command center — rather than expanding the native per-instance dashboards. The `hermes-ops` module is the seed; everything below extends it. + +## Goal / Target State + +A single private dashboard where, for **both Vijay and Bheem**, S can see at a glance: + +- live instance health (gateway, dashboard, cron, backup freshness, disk/mem, Google auth) — **real, cached, robust** +- everything each Hermes is doing / did / failed / is blocked on — **from real session/cron/task telemetry, filterable by instance** +- backup & disaster-recovery posture **at parity across both instances** +- what needs founder attention, pushed to the right Telegram chat + +…with the whole thing private-only, authenticated, tested, and in CI. + +--- + +## Gap Inventory (consolidated) + +| ID | Gap | Source | Severity | +|----|-----|--------|----------| +| G1 | 6 of 7 Mission Control panes are mock (`web/src/lib/hermes`) | v1 roadmap / README "read-only mock" | High | +| G2 | Tasks/Products/History/Agents have **no instance dimension** (Vijay vs Bheem) | this review | High | +| G3 | `hermes-ops` backend not hardened: no cache (~20 `execFile` per 60s poll), brittle Uma checks (`ps` string-match + hardcoded `existsSync`), errors swallowed to `null`, **no tests** | REVIEW_ACTIONS P1 #3 | High | +| G4 | No real telemetry ingestion (sessions, cron, memory, skills, alerts, backup history, task events) | v1 roadmap "real telemetry plan" | High | +| G5 | App/CI hygiene: CI path wrong (P0), lint is a no-op echo (P0), thin tests (P1), SSE disabled (P1), doc drift 3000 vs 3049 (P1), privileged docker socket/host mounts undocumented (P1), no prod logging (P2), E2E not wired (P2) | REVIEW_ACTIONS | P0–P2 | +| G6 | Mission Control polish: warning severity filters, trend cards, ops→ledger deep links, per-instance action rows, theme toggle | v1 "Next Improvements" | Med | +| G7 | **Bheem/Uma parity:** no persistent backup repo + timer equivalent to root, no watchdog cron, restore never tested, no quarterly drill | setup roadmap (lines 20, 146, 172, 185, 432, 447, 461) | High | +| G8 | Security/access: devops dashboard hermes routes need auth; `security.redact_secrets` / `privacy.redact_pii` undecided; GitHub/Gitea least-privilege audit + rotation pending | setup roadmap Phase 11 | High | +| G9 | Notifications: dashboard warnings not pushed to Telegram; approval-prompt flow + media/file delivery UX unvalidated | setup roadmap Phase 6 | Med | + +--- + +## Phase 0 — Guardrails (must hold throughout) + +- [ ] No public Caddy route or public listener for **any** Hermes dashboard, the `hermes-ops` API, or the DevOps dashboard's hermes data. Private-only via Tailscale / SSH tunnel / loopback. +- [ ] Keep Hermes command approvals at `manual` or `smart`; no gateway approval bypass. +- [ ] No raw secrets, tokens, OAuth files, `state.db`, or SQLite WAL/SHM in any git backup or in this repo. +- [ ] Re-run the Caddy/port review (`docs/hermes-operations.md`) before adding any route or hostname. + +## Phase 1 — Make the unified backend authoritative and hardened (G3) + +The `hermes-ops` snapshot becomes the single source of truth for live status. Before building UI on it, harden it. + +- [ ] Add a short-TTL cache (mirror the health module's 30s cache) so the 60s panel poll doesn't fan out ~20 `systemctl`/`git`/`ps`/`du` subprocesses every refresh; serve cached snapshot with `generatedAt`. +- [ ] Replace brittle Bheem/Uma checks in `repository.ts`: + - `isUmaGatewayActive()` (currently `ps -eo` string match) → `runuser -u uma -- systemctl --user is-active uma-hermes-gateway.service` (or `--machine=uma@.host`). + - `isUmaGatewayEnabled()` (currently hardcoded `existsSync` of a wants-symlink) → `systemctl --user is-enabled` via the same path. +- [ ] Stop swallowing every failure to `null` indiscriminately: distinguish "unit inactive" from "probe failed/timed out" and surface per-field status so the UI can show *unknown* vs *down*. +- [ ] Add Zod validation + a stable typed contract for `HermesOpsSnapshot` on the route. +- [ ] **Add unit tests for the `hermes-ops` repository** (mock `execFile`/fs) — closes the REVIEW_ACTIONS "only `services` has tests" gap for this module. +- [ ] Read Bheem/Uma state via a **self-reporting ops exporter** (Decision #2): a read-only `uma` user-systemd timer writes a sanitized JSON snapshot to a known path; the root backend reads + aggregates it (Vijay gets a symmetric exporter). Interim stopgap until it ships: `runuser -u uma -- systemctl --user is-active/is-enabled` instead of the `ps`/`existsSync` checks. + +## Phase 2 — Instance dimension across Mission Control (G2) + +- [ ] Add `instanceId: 'vijay' | 'bheem'` to the core types in `web/src/lib/hermes` (`HermesTask`, `HermesProduct`, `HermesEvent`, `HermesRun`, agent/overview models) and to the backend contracts. +- [ ] Add a global **instance switcher** in `HermesShell` (`All` / `Vijay (root)` / `Bheem (uma)`) with persisted selection; thread it through every pane. +- [ ] Overview: show per-instance cards **and** a combined roll-up (extend the existing "Healthy instances 2/2" pattern from the ops panel to the whole overview). +- [ ] Ledger / Products / History / Agents: filter and badge by instance. + +## Phase 3 — Real per-instance telemetry, replacing mock pane by pane (G1, G4) + +Define the ingestion contract first, then convert panes. Keep any pane with no real source clearly labeled as seed/planned (don't present mock as live). + +- [ ] **Primary source = real artifacts** (Decision #1): sessions, cron, watchdog alerts, backup history — read-only and cached. Treat a Hermes *session* as the work unit. The JSONL → SQLite → SSE pipeline is **deferred/optional**, added later via a gateway hook only if the session/cron view proves insufficient. +- [ ] Backend endpoints per instance, reading real Hermes state: + - [ ] Sessions + stats (`hermes sessions stats` — baseline today: Vijay 59 sessions/5225 msgs, Bheem 18/635). + - [ ] Cron jobs (`hermes cron list`) including backup + watchdog timers. + - [ ] Memory + skills inventory. + - [ ] Watchdog alerts feed (from `hermes-health-watchdog.py` output / logs). + - [ ] Backup history (git log of each backup repo: HEAD, last-commit age, freshness). +- [ ] Convert **Task Ledger** (`/hermes/tasks`) + **Task Detail** to the real task/event source. +- [ ] Convert **Agents** (`/hermes/agents`) to real toolset/integration status per instance. +- [ ] Convert **History** (`/hermes/history`) to real session/cron/backup trends. +- [ ] **Products** (`/hermes/products`): repoint at the real service registry (`backend/src/modules/services/`) + health module (Decision #3); drop the fabricated 50-item mock. Optional manual entries for not-yet-deployed products come later. + +## Phase 4 — Bheem/Uma parity so the dashboard shows two equal instances (G7) + +This is the biggest operational asymmetry and the reason half the ops-panel warnings are Bheem-only. + +- [ ] Stand up a **Uma persistent backup repo + `uma-hermes-backup.timer`** mirroring the root design (sanitized `hermes_persistent_backup/`, secrets and `state.db` excluded), pushing to `umadev0931/uma_hostinger_hermes_vm` **with a Uma-owned, repo-scoped token (Bheem self-pushes; root no longer pushes Uma's backup — Decision #5)**. +- [ ] Install a **Uma health watchdog** (mirror `scripts/hermes-health-watchdog.py`), silent-on-success, alerting Uma's Telegram. +- [ ] Run the **first Uma restore rehearsal** into a temporary `HERMES_HOME`; document in `docs/hermes-operations.md` / `docs/hermes-disaster-recovery.md`. +- [ ] Schedule a **quarterly Uma restore-drill reminder** (parity with root). +- [ ] Confirm these close the corresponding Bheem warnings emitted by `getHermesOpsSnapshot()` (backup timer active, repo HEAD readable + clean, Google token present). + +## Phase 5 — Dashboard app hardening (G5) + +- [ ] **P0:** Fix the CI workspace path (`${{ gitea.workspace }}`) in `.gitea/workflows/ci.yml`, `DEPLOYMENT.md`, `scripts/deploy-hotcopy.sh` (currently point at non-existent `/opt/bytelyst/bytelyst-devops-tools/...`). +- [ ] **P0:** Replace the no-op `lint` echo with real linting (`next lint` for web, minimal ESLint for backend); make `pnpm lint` fail on bad code. +- [ ] **P1:** Add tests for `auth`, `csrf`, `deployments/orchestrator`, `health`, **and `hermes-ops`**; add `pnpm test:coverage` gate. +- [ ] **P1:** Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. +- [ ] **P1:** Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). +- [ ] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). +- [ ] **P2:** Structured backend logging (pino → stdout); wire E2E (`hermes.spec.ts`) into CI with a started stack. + +## Phase 6 — Mission Control UX polish (G6) + +- [ ] Severity-tag warnings (info/warn/critical) and add a severity filter to the ops panel. +- [ ] Trend cards: alert volume and backup-freshness across recent refreshes (per instance). +- [ ] Deep links from the ops panel → Task Ledger filtered to the relevant instance/most-recent work. +- [ ] Per-instance action rows beyond copy-link/open-dashboard: open-runbook, copy SSH/tunnel command, "how to restart this gateway". +- [ ] Optional dark/light theme toggle if the shell supports it. +- [ ] Unified alerts feed across both instances on the overview. + +## Phase 7 — Security & access (G8) + +- [ ] Require authentication on the DevOps dashboard's hermes routes/endpoints (reuse platform-service auth already used elsewhere). +- [ ] Decide and document `security.redact_secrets` and `privacy.redact_pii` for gateway sessions (per instance). +- [ ] Finish the GitHub/Gitea **least-privilege token audit** (root currently pushes both repos) and rotate any migrated/exposed credentials — completed naturally by Decision #5 (Bheem self-pushes with its own scoped token). +- [ ] Keep all hermes data private-only; never expose the `hermes-ops` snapshot or task data on a public route. + +## Phase 8 — Notifications & Telegram loop (G9) + +- [ ] Push new dashboard-detected warnings to the correct Telegram (Vijay → root chat, Bheem → Uma chat), reusing the watchdog delivery path; silent on healthy. +- [ ] Validate the Telegram approval-prompt flow and media/file delivery end-to-end (the two unchecked v1 items). +- [ ] Preserve the numbered-emoji progress convention (`1️⃣`, `2️⃣`, …) for completion updates. + +--- + +## Data Model Additions + +```ts +// web/src/lib/hermes (and mirrored in backend contracts) +export type HermesInstanceId = 'vijay' | 'bheem'; + +export interface HermesInstanceRef { + id: HermesInstanceId; + label: string; // "Vijay / root", "Bheem / Uma" + user: string; // "root" | "uma" + hermesHome: string; +} + +// add `instanceId: HermesInstanceId` to: +// HermesTask, HermesProduct, HermesEvent, HermesRun, HermesAgentStatus, HermesOverview +``` + +## Acceptance Criteria + +This roadmap is complete when: + +- [ ] The overview, ledger, agents, and history panes render **real data for both Vijay and Bheem**, filterable by instance; only panes without a real source remain (clearly labeled) seed data. +- [ ] `hermes-ops` is cached, uses robust Uma user-systemd checks, distinguishes unknown vs down, and has unit tests. +- [ ] Bheem has a persistent backup repo + timer, a watchdog, and one completed restore rehearsal — and the dashboard shows **2/2 healthy** with zero standing Bheem warnings. +- [ ] CI is green on the correct path, lint is real, and coverage includes auth/csrf/orchestrator/health/hermes-ops. +- [ ] Hermes routes require auth and remain private-only; redact policies are decided and documented. +- [ ] Dashboard warnings reach the correct Telegram chat per instance. + +## Implementation Status Checklist + +Update only with evidence (source review, tests, build output, or browser/VM verification). + +- [ ] Phase 0 — Guardrails reconfirmed +- [ ] Phase 1 — `hermes-ops` hardened + tested +- [ ] Phase 2 — Instance dimension + switcher +- [ ] Phase 3 — Real telemetry ingestion + panes converted +- [ ] Phase 4 — Bheem/Uma parity (backup, watchdog, restore drill) +- [ ] Phase 5 — App/CI hardening (P0 → P2) +- [ ] Phase 6 — UX polish +- [ ] Phase 7 — Security & access +- [ ] Phase 8 — Notifications & Telegram + +## Decisions (resolved 2026-05-30) + +1. **Task data source — derive from real artifacts now; defer the JSONL pipeline.** Hermes' real unit of work is the *session* (+ cron jobs), and there's no evidence the agent emits a task-level JSONL ledger today. Build the ledger/activity views from what already exists — `hermes sessions` (+ stats), `hermes cron list` (+ last-run), watchdog alerts, and backup git history. Add a JSONL session/event pipeline → SQLite **later and only if** the session/cron view proves insufficient (via a gateway hook that appends records). Do not fabricate a task store. +2. **Reading Bheem state — self-reporting ops exporter per instance.** Each instance runs a tiny read-only exporter (Bheem as a `uma` user-systemd timer, Vijay symmetrically) that writes a **sanitized** JSON snapshot (booleans, counts, timestamps, short HEADs — no secrets) to a known path; the unified backend just reads and aggregates the two files. No cross-user command execution or reaching into `/home/uma/.hermes`. **Interim stopgap** until the exporter ships: replace the brittle `ps`/`existsSync` Uma checks with `runuser -u uma -- systemctl --user is-active/is-enabled`. +3. **Products — repoint at the real service registry; drop the fabricated mock.** The dashboard already has a live service registry (`backend/src/modules/services/`, with health). Back the Products pane with that real data instead of a 50-item fiction; allow optional manual entries later for not-yet-deployed products. Relabel clearly until the mapping lands. +4. **Auth — reuse platform-service JWT, defense-in-depth.** Put the hermes routes behind the same platform-service auth (`backend/src/lib/auth.ts`) the rest of the dashboard uses; keep the network private (Tailscale/loopback) as a second layer. No separate basic-auth gate (that's only the never-used "if forced public" path). +5. **Bheem backup — same repo + Drive, but Uma-owned least-privilege token; Bheem self-pushes.** Keep `umadev0931/uma_hostinger_hermes_vm` and Bheem Drive, but give Bheem its own repo-scoped credential so it backs itself up rather than depending on root's broad credential. Root stops pushing Uma's backup; this also closes the standing GitHub least-privilege audit item. + +## Suggested Execution Order + +1. Phase 5 P0 (CI path + lint) — unblocks everything. +2. Phase 1 (harden `hermes-ops`) — the foundation the real UI sits on. +3. Phase 2 (instance dimension) + Phase 4 (Bheem parity) in parallel — make "two instances" first-class in both data and ops. +4. Phase 3 (real telemetry, pane by pane). +5. Phase 7 (auth) before any wider access; Phase 8 (Telegram) and Phase 6 (polish) last. + +Each item is sized to land as a single PR with incremental commits to `main`. diff --git a/docs/prompts/ci-e2e-hardening.md b/docs/prompts/ci-e2e-hardening.md new file mode 100644 index 0000000..766c13a --- /dev/null +++ b/docs/prompts/ci-e2e-hardening.md @@ -0,0 +1,88 @@ +# Delegation Brief — Dashboard CI + E2E Hardening + +> Self-contained task brief for a delegated agent (Hermes `delegate_task`, a fresh +> Claude Code / Devin session, etc.). Generated 2026-05-30. Execute the whole thing +> end-to-end and report per the Final Report section. +> +> Related: `docs/hermes_dashboard_v2_roadmap.md` (Phase 5), `dashboard/REVIEW_ACTIONS.md` (#2, #11). + +--- + +ROLE: Senior fullstack + DevOps engineer. + +OBJECTIVE: Make the ByteLyst DevOps Dashboard's end-to-end (Playwright) test +suite complete and GREEN inside the self-hosted Gitea CI. Fixing the broken CI +workflow is a prerequisite and is in scope. + +REPO & STACK (start by reading, don't assume): +- Repo: /opt/bytelyst/learning_ai_devops_tools (GitHub remote: bytelyst-devops-tools) +- App: dashboard/ pnpm workspace — backend/ (Fastify 5, @bytelyst/devops-backend, :4004) + and web/ (Next.js 16, @bytelyst/devops-web, :3000 / container 3049). +- Tooling: Node 22+, pnpm 10.6.5. Vitest = unit, Playwright = E2E. +- Read first: docs/hermes_dashboard_v2_roadmap.md (Phase 5), dashboard/REVIEW_ACTIONS.md + (#2 lint, #11 E2E), and dashboard/.gitea/workflows/ci.yml. + +CONTEXT YOU NEED: +- CI runs on a LOCAL self-hosted Gitea act_runner (/opt/bytelyst/.act_runner). Keep it + local — do NOT move CI to GitHub-hosted runners. +- ci.yml is currently broken: every step `cd`s into the non-existent path + /opt/bytelyst/bytelyst-devops-tools/dashboard and does `git reset --hard origin/main` + on the live host checkout. The real path is .../learning_ai_devops_tools/dashboard. +- Existing E2E specs: web/e2e/dashboard.spec.ts and web/e2e/hermes.spec.ts. + `pnpm test:e2e` is defined but not reliably wired (no started stack). +- Hermes Mission Control routes to cover: /hermes, /hermes/tasks, /hermes/tasks/[id] + (use a real seed id from web/src/lib/hermes), /hermes/products, /hermes/history, + /hermes/agents, /hermes/settings. Plus / (dashboard) and /login. +- IMPORTANT determinism note: the 6 Mission Control panes render from client-side seed + data (web/src/lib/hermes) and need NO backend. BUT the /hermes overview also mounts + , which calls api.getHermesOps() → backend (which shells out to + systemctl/git/ps on the VM and WON'T work in CI). Make E2E deterministic by + intercepting that API call with Playwright page.route() and returning a fixture + snapshot. Do the same for any other live backend call. +- Auth: app may redirect to /login (platform-service). Use a Playwright global-setup + that authenticates once and reuses storageState, OR a documented test auth bypass. + Do not hardcode real credentials — read from env / .env.example placeholders. + +TASKS: +1. Fix dashboard/.gitea/workflows/ci.yml: + - Replace host-path `cd` + `git reset --hard` with checkout into ${{ gitea.workspace }} + (actions/checkout) + `working-directory: dashboard` on steps. CI must NOT mutate the + live host checkout. + - Use `pnpm install:gitea` (local Gitea registry mode) so CI is self-contained instead + of depending on a sibling learning_ai_common_plat checkout. + - Keep the existing build/typecheck/test/secret-scan/docker steps working. + - Fix the same stale path in DEPLOYMENT.md and scripts/deploy-hotcopy.sh. +2. Make `lint` real (it's currently `echo`): `next lint` for web, minimal ESLint + + @typescript-eslint for backend. `pnpm lint` must fail on bad code. +3. Complete the Playwright E2E suite so it loads every route above, asserts key content + renders (headings, the ops-panel "Healthy instances X/2" card, the task table, filters), + and asserts ZERO console errors per page. Add at least one mobile-width run. +4. Wire E2E into CI with a started stack: prefer Playwright's `webServer` config to build + + start web (and backend if needed) before tests, so the Gitea "E2E tests" step is + self-contained. Document any required env in .env.example, not real secrets. + +GUARDRAILS: +- Private-only project: do not add any public route/listener; do not expose anything. +- Do not commit secrets, tokens, OAuth files, state.db, or SQLite WAL/SHM. +- Keep changes minimal and reuse existing conventions; do not rewrite the app. + +VERIFY (all must pass from dashboard/): +- pnpm install:gitea (or install:common-plat for local dev) +- pnpm secret-scan +- pnpm --filter @bytelyst/devops-backend typecheck && build && test:run +- pnpm --filter @bytelyst/devops-web typecheck && build && test:run +- pnpm lint (now real, must pass) +- pnpm --filter @bytelyst/devops-web test:e2e (must pass, no console errors) + +GIT FLOW: +- Work on branch `ci/e2e-hardening`. Commit incrementally with clear messages. +- Push the branch and trigger one Gitea CI run; confirm it goes GREEN end-to-end + (build → test → lint → E2E → docker). Only then open a PR to main. + +DEFINITION OF DONE: +- Gitea CI is green on the corrected workflow, lint is real, and the E2E step runs the + full suite (all 7 Hermes routes + dashboard + login) deterministically with no console + errors, including a mobile-width pass. + +FINAL REPORT: summarize files changed, how E2E is started in CI, what was mocked and why, +the green CI run link/id, and any gaps left.