Hermes VM eaaa545e6c feat(dashboard): close Phase 6 (trend cards + theme toggle), drop-root scaffold, Agents inventory, Phase 0 reconfirm

Closes the remaining tractable items from the carry-forward queue.

1. Drop-root scaffold for the backend container (P2 mitigation)
   `backend/Dockerfile` adds non-root `app` user (uid 1001) + `docker`
   group (gid via `DOCKER_GID` build arg, default 999). `BACKEND_USER`
   build arg defaults to `root` so existing deployments keep working;
   set it to `app` plus `DOCKER_GID=$(getent group docker | cut -d: -f3)`
   to flip the runtime non-root. `dashboard/DEPLOYMENT.md` gets a new
   "Running non-root" section with the exact `chgrp`/`chmod` recipe
   for the bind-mounted log files (the host-side prep that pairs with
   the build flip). DEPLOYMENT.md mitigation roadmap updated.

2. Phase 6 trend cards
   `lib/hermes-ops-history.ts` keeps the last 24 ops snapshots in
   localStorage (de-duped on `generatedAt`, schema-guarded on read,
   degrades silently on quota exceeded). Three trend cards in the
   ops panel:
     - Warning-volume sparkline + current count
     - Healthy-instance count sparkline (X/2)
     - Per-instance "minutes since last backup commit" with a 30m
       stale threshold
   SVG polyline sparklines, no chart library — `<svg viewBox="0 0
   100 100" preserveAspectRatio="none">` with `vector-effect:
   non-scaling-stroke` so the line stays 2px regardless of the
   parent's width.

3. Phase 6 theme toggle
   `components/theme-toggle.tsx` Sun/Moon button mounted in the
   Hermes layout next to the instance switcher. Persists in
   localStorage `bytelyst.theme.v1`. The design system already
   defined `[data-theme="light"]` overrides in `styles/tokens.css`;
   the toggle just sets the attribute. FOUC-prevention inline script
   in the root layout reads the same key BEFORE React hydrates so
   the first paint matches the user's last choice.

4. Phase 3 partial close: Agents pane → telemetry inventory
   `/hermes/agents` now renders a "Memory & Skills inventory (live)"
   SectionCard backed by the Phase 3 telemetry endpoint per instance
   — `hermes memory list` and `hermes skills list` rendered with
   per-section probe-status badges (`up`/`unknown`), item counts,
   and the first N entries each. Agent **health** statuses (latency,
   failure rate, last-success/failure) stay seed-data — observability
   for those needs a separate ingestion contract that the telemetry
   endpoint doesn't provide today.

5. Phase 0 reconfirmation
   Roadmap Phase 0 ticked with explicit verification notes for each
   guardrail (no public listener, manual approvals, secret hygiene,
   Caddy review). Remains "must hold throughout" — the ticks reflect
   today's verified state, not single-checkbox completion.

Verified: backend typecheck ✅, 74/74 backend unit tests ✅, web
typecheck ✅, 7/7 E2E ✅, lint 0 errors, build green, coverage gate
≥95% lines on every gated file.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

2026-05-30 08:26:26 +00:00

23 KiB

Raw Blame History

DevOps & Admin Dashboard Deployment Guide

Canonical deployment doc for dashboard/. The previous DEPLOYMENT_GUIDE.md has been folded into this file; it remains as a one-line redirect for backwards compatibility with deploy.sh and external links.

Overview

This guide covers deploying both the DevOps Dashboard and Platform Admin Dashboard using the existing Traefik gateway infrastructure, following the same pattern as the trading dashboard (https://invttrdg.bytelyst.com).

Public URLs

DevOps Dashboard: https://devops.bytelyst.com
Admin Dashboard: https://admin.bytelyst.com
API Gateway: https://api.bytelyst.com
- Platform API: https://api.bytelyst.com/platform/api
- DevOps API: https://api.bytelyst.com/api/devops

Ports — quick reference

The web container always listens on 3000 internally; what changes is what the host exposes. Memorize the column for the deployment mode you're in:

Mode	Web (host)	Backend (host)	Notes
Local dev (`pnpm dev`)	`localhost:3000`	`localhost:4004`	Next listens directly on 3000.
Docker Compose (this repo)	`localhost:3049`	`localhost:4004`	`docker-compose.yml` maps `127.0.0.1:3049:3000` (loopback only).
Production (Traefik)	`https://devops.bytelyst.com`	`https://api.bytelyst.com/api/devops`	Traefik label `loadbalancer.server.port=3000` targets the container port.

Whenever a doc says "the dashboard runs on port 3000", it means the container port seen by Traefik / Next dev mode — not the host port for the deployed stack. Use the table above instead of relying on prose.

Architecture

Internet → Traefik Gateway → Services
                              ├─ DevOps Web      (container :3000, host :3049)
                              ├─ DevOps Backend  (:4004)
                              ├─ Admin Web       (:3001)
                              ├─ Platform Service (:4003)
                              └─ Trading Dashboard (:3085)

Traefik: API gateway and reverse proxy.
Docker network: All services connect via learning_ai_common_plat_default.
Domain routing: Traefik routes by host header.
SSL/TLS: Managed by Traefik with Let's Encrypt.

Prerequisites

Platform stack running with Traefik gateway.
Docker and Docker Compose installed.
Domain names configured with DNS pointing to your server.
Azure Cosmos DB account (shared with platform-service).
Platform Service running and accessible.

Quick Start

1. Start the platform stack (if not running)

cd /opt/bytelyst/learning_ai_common_plat
docker-compose up -d

2. Deploy the dashboards

cd /opt/bytelyst/learning_ai_devops_tools/dashboard
./deploy.sh

This will:

Deploy the DevOps Dashboard (backend + web)
Deploy the Admin Dashboard via the platform stack
Run health checks
Print deployment information

Local development

If you only need a non-containerized iteration loop (no Traefik, no Docker):

cd /opt/bytelyst/learning_ai_devops_tools/dashboard

# Resolve workspace deps
pnpm install:common-plat   # uses sibling learning_ai_common_plat checkout
# or
pnpm install:gitea         # uses local Gitea registry at localhost:3300

pnpm dev                   # backend on 4004, web on 3000 (NOT 3049)

Required env vars are documented under Environment Configuration below; for local dev a minimal .env with JWT_SECRET, COSMOS_*, and PLATFORM_SERVICE_URL is enough.

Manual Docker deployment

Deploy DevOps Dashboard

cd /opt/bytelyst/learning_ai_devops_tools/dashboard
docker-compose up -d --build

Deploy Admin Dashboard

cd /opt/bytelyst/learning_ai_common_plat
docker-compose up -d admin-web

Environment Configuration

DevOps Dashboard (`.env`)

# Backend
PORT=4004
PLATFORM_SERVICE_URL=http://platform-service:4003
COSMOS_ENDPOINT=https://your-cosmos-account.documents.azure.com:443/
COSMOS_KEY=your-cosmos-primary-key
COSMOS_DATABASE=bytelyst-platform
JWT_SECRET=your-production-jwt-secret
CSRF_SECRET=your-production-csrf-secret
ENCRYPTION_KEY=your-production-encryption-key
PRODUCT_ID=bytelyst-devops
PRODUCT_NAME=ByteLyst DevOps Dashboard

# Azure Key Vault (optional)
AZURE_TENANT_ID=your-tenant-id
AZURE_CLIENT_ID=your-client-id
AZURE_CLIENT_SECRET=your-client-secret
AZURE_KEY_VAULT_URL=https://your-keyvault.vault.azure.net/

# Frontend
NEXT_PUBLIC_DEVOPS_API_URL=https://api.bytelyst.com/devops
NEXT_PUBLIC_PLATFORM_URL=https://api.bytelyst.com/platform/api
NEXT_PUBLIC_ADMIN_WEB_URL=https://admin.bytelyst.com
NEXT_PUBLIC_PRODUCT_ID=bytelyst-devops
NEXT_PUBLIC_PRODUCT_NAME=ByteLyst DevOps Dashboard

Platform Dashboard (`.env`)

Add to your platform .env:

# Admin Web Dashboard
NEXT_PUBLIC_PLATFORM_URL=https://api.bytelyst.com/platform/api
NEXT_PUBLIC_DEVOPS_WEB_URL=https://devops.bytelyst.com

Traefik Configuration

Both dashboards use Traefik labels for routing.

DevOps Web

labels:
  - 'traefik.enable=true'
  - 'traefik.http.routers.devops-web.rule=Host(`devops.bytelyst.com`)'
  - 'traefik.http.services.devops-web.loadbalancer.server.port=3000'   # container port

DevOps Backend API

labels:
  - 'traefik.enable=true'
  - 'traefik.http.routers.devops-api.rule=PathPrefix(`/api/devops`)'
  - 'traefik.http.services.devops-api.loadbalancer.server.port=4004'

Admin Web

labels:
  - 'traefik.enable=true'
  - 'traefik.http.routers.admin-web.rule=Host(`admin.bytelyst.com`)'
  - 'traefik.http.services.admin-web.loadbalancer.server.port=3001'

DNS Configuration

Add DNS records pointing to your Traefik gateway server:

devops.bytelyst.com      A  <your-server-ip>
admin.bytelyst.com       A  <your-server-ip>
api.bytelyst.com         A  <your-server-ip>

SSL/TLS Configuration

Traefik can automatically handle SSL certificates with Let's Encrypt:

command:
  - '--certificatesresolvers.myresolver.acme.tlschallenge=true'
  - '--certificatesresolvers.myresolver.acme.email=admin@bytelyst.com'
  - '--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json'

Then update router labels:

labels:
  - 'traefik.http.routers.devops-web.tls=true'
  - 'traefik.http.routers.devops-web.tls.certresolver=myresolver'

DevOps Dashboard → Admin Dashboard

Header includes a "Platform Admin" link with Shield icon.
Opens admin dashboard in a new tab.
Uses NEXT_PUBLIC_ADMIN_WEB_URL.

Admin Dashboard → DevOps Dashboard

Sidebar includes a "DevOps Dashboard" link with Server icon.
Opens devops dashboard in a new tab.
Uses NEXT_PUBLIC_DEVOPS_WEB_URL.

Shared Authentication

Platform Service Auth: Both authenticate against platform-service.
JWT Tokens: Same JWT_SECRET validates tokens across services.
Per-Product Access: Admin access is checked per-product via membership roles.
Single Sign-On: Users stay logged in across both dashboards.

Granting Access

To grant a user access to both dashboards:

Ensure user exists in platform-service.
Add admin membership for both products:

{
  "memberships": [
    { "productId": "bytelyst-devops",   "role": "admin", "plan": "pro" },
    { "productId": "bytelyst-platform", "role": "admin", "plan": "pro" }
  ]
}

Health Checks

DevOps Backend: http://localhost:4004/health
DevOps Web: http://localhost:3049 (Docker Compose host port; container :3000)
Admin Web: http://localhost:3001
Traefik Dashboard: http://localhost:8080

Troubleshooting

Network issues

# Check if the platform network exists
docker network inspect learning_ai_common_plat_default

# Check container connectivity
docker network inspect learning_ai_common_plat_default | grep devops

Traefik routing

# Traefik dashboard
http://localhost:8080

# Traefik logs
docker logs $(docker ps -q -f name=gateway)

# Router config for the devops web container
docker inspect devops-web | grep -A 10 Labels

Authentication failures

Verify JWT_SECRET matches across all services.
Check platform-service is accessible: curl http://localhost:4003/health.
Ensure the user has the right product memberships.

Service not starting

docker logs devops-backend
docker logs devops-web
docker logs admin-web
docker ps
docker inspect devops-backend | grep -A 5 Health

Workspace dependency errors

pnpm install:common-plat   # local sibling checkout
pnpm install:gitea         # local Gitea registry

Service Management

Stop services

cd /opt/bytelyst/learning_ai_devops_tools/dashboard
docker-compose down

cd /opt/bytelyst/learning_ai_common_plat
docker-compose stop admin-web

Restart services

cd /opt/bytelyst/learning_ai_devops_tools/dashboard
docker-compose restart

cd /opt/bytelyst/learning_ai_common_plat
docker-compose restart admin-web

View logs

# DevOps
docker logs -f devops-backend
docker logs -f devops-web

# Admin
docker logs -f admin-web

# Traefik
docker logs -f gateway

Comparison with Trading Dashboard

Feature	Trading	DevOps	Admin
Domain	invttrdg.bytelyst.com	devops.bytelyst.com	admin.bytelyst.com
Web Port	3085 (host)	3049 (host) / 3000 (ctr)	3001 (host)
Backend Port	4018	4004	N/A
Network	platform_net	platform_net	default
Traefik	Yes	Yes	Yes
Auth	Platform	Platform	Platform

Privilege Surface (Docker socket + host mounts)

The devops-backend container has root-equivalent access to the host. This section documents exactly what is mounted, which routes use each mount, and what the blast radius looks like if an admin token leaks. It exists so reviewers don't have to reverse-engineer this from docker-compose.yml and the route handlers — and so any future change to the mount set is reviewed against this list rather than slipped in.

Mounts (from `docker-compose.yml`)

Host path	Container path	Mode	Purpose
`/var/run/docker.sock`	`/var/run/docker.sock`	rw	Allows `docker` CLI inside the container to control the host daemon. Used by the `system` and `vm` modules. Equivalent to root on the host.
`/opt/bytelyst/learning_ai_devops_tools/scripts`	`/vm-scripts`	ro	Bash scripts the `vm` module shells out to (`HostingerVM/*.sh`). Read-only mount; the container cannot modify the script set.
`/var/log/vm-cleanup.log`	`/host-logs/vm-cleanup.log`	rw	The `vm` cleanup script appends here; backend reads it via `/api/vm/cleanup-log`.
`/var/log/vm-health-check.log`	`/host-logs/vm-health-check.log`	rw	Health-check probe output; backend reads it via `/api/vm/health`.
`/var/log/docker-watchdog.log`	`/host-logs/docker-watchdog.log`	rw	Watchdog tail used by the VM panel.
`extra_hosts: host-gateway`	`host.docker.internal`-equivalent	—	Lets the container reach `host:11434` (Ollama) and other host-only services. Not a filesystem mount, but a privilege-relevant capability — the container can talk to anything bound to `127.0.0.1` on the host.

The container's listening port (4004) is bound to 127.0.0.1 only, so the API is not exposed to the public internet by this compose file — access is expected via Tailscale or an SSH tunnel. Any reverse proxy in front of it (Traefik in production) is responsible for its own auth + TLS.

What shells out + which routes (auth column = effective gate)

Route	Handler module	What it executes	Auth
`GET /system/metrics`	`system/repository.ts`	`df -h ...`	`requireAdmin`
`GET /docker/stats`	`system/repository.ts`	`docker images / ps / volume ls / system df` (read-only)	`requireAdmin`
`POST /docker/cleanup`	`system/repository.ts`	`docker container prune -f`, `docker image prune -a -f`, `docker volume prune -f`, `docker builder prune -f` (a fixed allow-list — request body picks one of the four "types")	`requireAdmin`
`GET /vm/health`	`vm/repository.ts`	`bash $VM_SCRIPTS_PATH/vm-health-check.sh --json`	`requireAdmin`
`GET /vm/cleanup-log`	`vm/repository.ts`	reads `/host-logs/vm-cleanup.log`	`requireAdmin`
`GET /vm/cron-status`	`vm/repository.ts`	`crontab -l`	`requireAdmin`
`POST /vm/cleanup`	`vm/repository.ts`	`bash $VM_SCRIPTS_PATH/vm-cleanup.sh`	`requireAdmin`
`GET /vm/containers`, `.../unhealthy`, `.../:name/logs`	`vm/repository.ts`	`docker ps`, `docker inspect`, `docker stats`, `docker logs`	`requireAdmin`
`POST /vm/containers/:name/restart`	`vm/repository.ts`	`docker restart "<name>"` (name is a path param — see "Known sharp edges" below)	`requireAdmin`
`GET /vm/ollama/models`, `DELETE /vm/ollama/models/:name`	`vm/repository.ts`	HTTP-only (talks to host Ollama via `host-gateway`). No shell-out.	`requireAdmin`
`POST /code-quality/check`	`code-quality/repository.ts`	`npm run typecheck`, `npm run lint`, `npm run build`, `npm run test:run` in the request-supplied `projectPath`.	`requireAdmin` (added concurrently with this doc; previously unauthenticated — see the Phase 5 P1 commit)
`POST /deployments/trigger/:serviceId`	`deployments/orchestrator.ts`	`bash <service.scriptPath>` from the registered service registry (paths are stored at create-time, not request-time).	`requireAdmin`
`/hermes/ops` (snapshot)	`hermes-ops/repository.ts`	Read-only probes: `systemctl is-active/is-enabled`, `git status`, `du -sh`, `ps`, `tailscale ip`, `runuser -u uma -- systemctl --user ...`. No state-changing commands.	`requireAdmin` (Phase 7 — private-only)
`/hermes/telemetry/:instance`	`hermes-telemetry/repository.ts`	Read-only: `runuser -u <user> -- hermes sessions/cron/memory/skills list --json`, `git -C <backup-repo> log`, tail of the watchdog log. No state-changing commands.	`requireAdmin`

Blast radius if an admin token is leaked

Anyone holding a valid admin JWT for this product can, today:

Run any of the four pre-defined docker prune commands (data loss for containers/images/volumes), restart any container, read any container's logs.
Trigger the host VM cleanup script and crontab listing.
Trigger any deployment script registered in the service registry.
Run npm run lifecycle scripts in any directory the container can read (since code-quality/check takes a caller-supplied projectPath).
Read the three host logs that are mounted in.

In other words, an admin token is equivalent to a host shell, modulo the specific commands the codebase chooses to wrap. There is currently no allow-list wrapper between the backend and the docker socket; the backend constructs docker ... shell strings directly with execAsync.

Known sharp edges (track and shrink)

Container name is interpolated into a shell string. docker restart "${name}" and similar paths in vm/repository.ts use execAsync with a template literal. The :name path parameter is admin-only but is not validated against a ^[a-zA-Z0-9._-]+$ allow-list. Lock this down before exposing the dashboard to a wider admin pool.
projectPath for /code-quality/check is unvalidated. The handler passes the caller-supplied path straight into execAsync({ cwd }). Even with requireAdmin added, this should be constrained to a known set of project roots (or rejected if it escapes the workspace).
No per-route audit-log on shell-outs. audit/repository.ts records deployment triggers but not /docker/cleanup or /vm/cleanup. A leaked token's actions are reconstructable only from container stdout + host logs.
The container runs as root. Both the backend Dockerfile and the bind- mounts assume root. A non-root user with docker group membership would shrink the in-container blast radius without losing functionality (the socket is still root on the host); revisit when ready.
fastify-rate-limit is global, not per-route. A leaked admin token currently isn't slowed down on the destructive endpoints any more than it is on read-only ones.

Mitigation roadmap (incremental, not all at once)

P1: Allow-list wrapper around shell-outs. (lib/shell.ts ships with execAllowed (no shell, just execFile with an explicit argv) plus per-command helpers — dockerRestart(name) validates against [a-zA-Z0-9][a-zA-Z0-9._-]{0,127}, dockerPrune(kind, {all?}) validates kind ∈ {container,image,volume,builder} and rejects --all on non-image, runBashScript(path, args, {allowedRoots}) and runNpmScript(script, {cwd, allowedRoots}) lock both the script path and cwd to a configured set of roots. 17 unit tests cover the rejection paths; vm/restartContainer and system/dockerCleanup migrated. Module covered by the test:coverage gate (≥95% lines).)
P1: Validate /code-quality/check's projectPath against a configured set of allowed roots. (runCodeQualityCheck now calls assertPathInAllowedRoots(projectPath, getAllowedRoots()) before any lifecycle script runs; getAllowedRoots() reads CODE_QUALITY_ALLOWED_ROOTS (colon-separated) with a default of /opt/bytelyst. The path is also re-resolved (normalised, .. collapsed) before being passed to runNpmScript, which lifts it to its own argv slot — no shell interpolation.)
P2: Audit-log every shell-out (command + arg vector + actor + result). (Audit schema extended with action: 'shell-exec' + entityType: 'host'. POST /docker/cleanup, POST /vm/cleanup, POST /vm/containers/:name/restart now write a Cosmos audit row including the actor (authUserId/authRole), entity id (docker-cleanup:<type> etc.), and a sanitized details payload. Audit writes are best-effort — a Cosmos hiccup logs a warn but never fails the request.)
P2: Run the backend container as a non-root user with docker group membership; rebuild the Dockerfile accordingly. (Dockerfile scaffolds a non-root app user (uid 1001) with docker group membership at a build-arg-configurable GID. Default BACKEND_USER=root preserves the current behaviour so existing deployments don't break; set BACKEND_USER=app and DOCKER_GID=$(getent group docker | cut -d: -f3) to flip it on. Requires host-side prep on the bind-mounted log files — see "Running non-root" below for the exact chmod/chgrp recipe.)
P3: Move from docker.sock to a thin daemon (docker-proxy-style) that exposes only the verbs the dashboard actually needs (stats, restart, logs, the four prune variants).

Running non-root

Concrete recipe to flip the backend off root:

# 1. Find the host's docker group GID
DOCKER_GID=$(getent group docker | cut -d: -f3)

# 2. Make the bind-mounted log files group-owned by docker and group-writable
#    so the in-container `app` user (gid=$DOCKER_GID) can read/write them.
sudo chgrp docker /var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log
sudo chmod g+rw /var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log

# 3. Confirm the VM scripts mount is world-readable (it's read-only inside
#    the container, so 0o755 on the directory is enough).
sudo chmod -R o+rX /opt/bytelyst/learning_ai_devops_tools/scripts

# 4. Rebuild the backend image with BACKEND_USER=app and the host's GID.
cd /opt/bytelyst/learning_ai_devops_tools/dashboard
docker compose build --build-arg BACKEND_USER=app --build-arg DOCKER_GID=$DOCKER_GID backend

# 5. Restart and verify
docker compose up -d backend
docker exec devops-backend whoami   # → app
docker exec devops-backend id       # uid=1001(app) gid=$DOCKER_GID(docker)
curl -fsS http://localhost:4004/health

If the backend can't reach the docker socket after the flip, double-check the in-container id matches getent group docker on the host. The docker.sock bind-mount carries its host ownership into the container, so the in-container gid must match.

Operators reviewing whether to grant a new admin should read this whole section before doing so. Adding a new shell-out path in code is a privilege change and must update this table in the same commit.

Production Checklist

Platform stack running with Traefik.
DNS records configured.
SSL/TLS certificates configured in Traefik.
Environment variables set for production.
Cosmos DB connection configured.
JWT_SECRET matches across all services.
User memberships configured for access.
Health checks passing.
Cross-navigation links working.
Monitoring and logging configured.

Features Implemented

Backend (port 4004)

✅ CI/CD pipeline with Gitea Actions
✅ E2E tests with Playwright (gated; see .gitea/workflows/ci.yml)
✅ Telemetry integration
✅ Error boundary
✅ CSRF protection with token refresh
✅ Service CRUD operations
✅ Deployment log retrieval (JSON polling — no SSE; see backend README)
✅ Audit logging
✅ Structured logging
✅ Database migrations
✅ Backup/restore functionality
✅ Performance monitoring (APM)
✅ System metrics (CPU, memory, disk)
✅ Docker cleanup endpoints
✅ OpenAPI/Swagger documentation at /docs

Frontend (container :3000, host :3049 under Compose)

✅ Service management UI
✅ Deployment monitoring
✅ Health dashboard
✅ Metrics/charts page
✅ System management page
✅ Log viewer (poll-based)
✅ Accessibility features (ARIA, keyboard nav)
✅ PWA manifest
✅ Responsive design

23 KiB Raw Blame History