Closes the three Phase 5 P2 follow-ups from the DEPLOYMENT.md
mitigation roadmap that don't need infra changes. Two P2 items remain
(non-root container, docker-proxy daemon) — both genuinely need
container/orchestration work and stay queued.
1. Allow-list shell wrapper (P1)
New `lib/shell.ts`:
- `execAllowed(cmd, args, opts)` — `execFile`-only, no shell, no
interpolation. Single escape hatch for ad-hoc invocations.
- `dockerRestart(name)` — name validated against
`[a-zA-Z0-9][a-zA-Z0-9._-]{0,127}`; throws InvalidShellArgError
on anything else (including non-strings, shell metacharacters,
command-substitution attempts). Tests cover all of these.
- `dockerPrune(kind, {all?})` — kind constrained to
{container,image,volume,builder}; `--all` only valid for image.
- `runBashScript(path, args, {allowedRoots})` — script path AND
cwd both checked against allowed roots; rejects `..` escapes
and prefix-matching siblings (`/opt/projects-evil` vs
`/opt/projects`).
- `runNpmScript(script, {cwd, allowedRoots})` — script ∈
{typecheck,lint,build,test,test:run,start}; cwd inside roots.
17 unit tests cover every rejection path. Module added to the
coverage gate (≥95% lines).
Migrated highest-risk callers off template-literal `exec`:
- `vm/repository.ts:restartContainer` → `dockerRestart`. Was
previously `await execAsync(\`docker restart "${name}"\`)`
with only a regex check; now goes through the wrapper.
- `system/repository.ts:dockerCleanup` → `dockerPrune` per kind
+ `execAllowed` for `docker system df`. Drops the array of
template-literal command strings entirely.
- `code-quality/repository.ts` → `runNpmScript` for every
lifecycle invocation. cwd is now the resolved (normalised,
`..`-collapsed) path, not the raw input.
2. projectPath validation for /code-quality/check (P1)
`runCodeQualityCheck` now calls
`assertPathInAllowedRoots(projectPath, getAllowedRoots())` before
any subprocess spawns. `getAllowedRoots()` reads
`CODE_QUALITY_ALLOWED_ROOTS` (colon-separated env, defaults to
`/opt/bytelyst`). Rejection happens with a clear error message
listing the configured roots so operators know what to allow.
3. Audit-log every privileged shell-out (P2)
`audit/types.ts` extended: `action` now includes `'shell-exec'`,
`entityType` includes `'host'`. The migration is additive — old
audit rows still validate.
Three privileged routes now write a `shell-exec` audit row with
actor (authUserId / authRole), entity id, and a sanitized details
payload before responding:
- `POST /docker/cleanup` — `entityId: docker-cleanup:<type>`,
details include {type, force, freedSpace}.
- `POST /vm/cleanup` — `entityId: vm-cleanup:<mode>`.
- `POST /vm/containers/:name/restart` — `entityId:
container-restart:<name>`, details include {success, message}.
Audited even on failure so attempted privileged actions are
still recorded.
Audit writes are best-effort — a Cosmos hiccup logs a warn but
never fails the request the operator was running.
Verified: backend typecheck ✅, 74/74 unit tests ✅ (17 new for
shell.ts + audit changes), 7/7 E2E ✅, lint 0 errors, coverage gate
≥95% lines on every gated file (which now includes shell.ts).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
22 KiB
DevOps & Admin Dashboard Deployment Guide
Canonical deployment doc for
dashboard/. The previousDEPLOYMENT_GUIDE.mdhas been folded into this file; it remains as a one-line redirect for backwards compatibility withdeploy.shand external links.
Overview
This guide covers deploying both the DevOps Dashboard and Platform Admin Dashboard using the existing Traefik gateway infrastructure, following the same pattern as the trading dashboard (https://invttrdg.bytelyst.com).
Public URLs
- DevOps Dashboard:
https://devops.bytelyst.com - Admin Dashboard:
https://admin.bytelyst.com - API Gateway:
https://api.bytelyst.com- Platform API:
https://api.bytelyst.com/platform/api - DevOps API:
https://api.bytelyst.com/api/devops
- Platform API:
Ports — quick reference
The web container always listens on 3000 internally; what changes is what the host exposes. Memorize the column for the deployment mode you're in:
| Mode | Web (host) | Backend (host) | Notes |
|---|---|---|---|
Local dev (pnpm dev) |
localhost:3000 |
localhost:4004 |
Next listens directly on 3000. |
| Docker Compose (this repo) | localhost:3049 |
localhost:4004 |
docker-compose.yml maps 127.0.0.1:3049:3000 (loopback only). |
| Production (Traefik) | https://devops.bytelyst.com |
https://api.bytelyst.com/api/devops |
Traefik label loadbalancer.server.port=3000 targets the container port. |
Whenever a doc says "the dashboard runs on port 3000", it means the container port seen by Traefik / Next dev mode — not the host port for the deployed stack. Use the table above instead of relying on prose.
Architecture
Internet → Traefik Gateway → Services
├─ DevOps Web (container :3000, host :3049)
├─ DevOps Backend (:4004)
├─ Admin Web (:3001)
├─ Platform Service (:4003)
└─ Trading Dashboard (:3085)
- Traefik: API gateway and reverse proxy.
- Docker network: All services connect via
learning_ai_common_plat_default. - Domain routing: Traefik routes by host header.
- SSL/TLS: Managed by Traefik with Let's Encrypt.
Prerequisites
- Platform stack running with Traefik gateway.
- Docker and Docker Compose installed.
- Domain names configured with DNS pointing to your server.
- Azure Cosmos DB account (shared with platform-service).
- Platform Service running and accessible.
Quick Start
1. Start the platform stack (if not running)
cd /opt/bytelyst/learning_ai_common_plat
docker-compose up -d
2. Deploy the dashboards
cd /opt/bytelyst/learning_ai_devops_tools/dashboard
./deploy.sh
This will:
- Deploy the DevOps Dashboard (backend + web)
- Deploy the Admin Dashboard via the platform stack
- Run health checks
- Print deployment information
Local development
If you only need a non-containerized iteration loop (no Traefik, no Docker):
cd /opt/bytelyst/learning_ai_devops_tools/dashboard
# Resolve workspace deps
pnpm install:common-plat # uses sibling learning_ai_common_plat checkout
# or
pnpm install:gitea # uses local Gitea registry at localhost:3300
pnpm dev # backend on 4004, web on 3000 (NOT 3049)
Required env vars are documented under Environment Configuration below; for
local dev a minimal .env with JWT_SECRET, COSMOS_*, and
PLATFORM_SERVICE_URL is enough.
Manual Docker deployment
Deploy DevOps Dashboard
cd /opt/bytelyst/learning_ai_devops_tools/dashboard
docker-compose up -d --build
Deploy Admin Dashboard
cd /opt/bytelyst/learning_ai_common_plat
docker-compose up -d admin-web
Environment Configuration
DevOps Dashboard (.env)
# Backend
PORT=4004
PLATFORM_SERVICE_URL=http://platform-service:4003
COSMOS_ENDPOINT=https://your-cosmos-account.documents.azure.com:443/
COSMOS_KEY=your-cosmos-primary-key
COSMOS_DATABASE=bytelyst-platform
JWT_SECRET=your-production-jwt-secret
CSRF_SECRET=your-production-csrf-secret
ENCRYPTION_KEY=your-production-encryption-key
PRODUCT_ID=bytelyst-devops
PRODUCT_NAME=ByteLyst DevOps Dashboard
# Azure Key Vault (optional)
AZURE_TENANT_ID=your-tenant-id
AZURE_CLIENT_ID=your-client-id
AZURE_CLIENT_SECRET=your-client-secret
AZURE_KEY_VAULT_URL=https://your-keyvault.vault.azure.net/
# Frontend
NEXT_PUBLIC_DEVOPS_API_URL=https://api.bytelyst.com/devops
NEXT_PUBLIC_PLATFORM_URL=https://api.bytelyst.com/platform/api
NEXT_PUBLIC_ADMIN_WEB_URL=https://admin.bytelyst.com
NEXT_PUBLIC_PRODUCT_ID=bytelyst-devops
NEXT_PUBLIC_PRODUCT_NAME=ByteLyst DevOps Dashboard
Platform Dashboard (.env)
Add to your platform .env:
# Admin Web Dashboard
NEXT_PUBLIC_PLATFORM_URL=https://api.bytelyst.com/platform/api
NEXT_PUBLIC_DEVOPS_WEB_URL=https://devops.bytelyst.com
Traefik Configuration
Both dashboards use Traefik labels for routing.
DevOps Web
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.devops-web.rule=Host(`devops.bytelyst.com`)'
- 'traefik.http.services.devops-web.loadbalancer.server.port=3000' # container port
DevOps Backend API
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.devops-api.rule=PathPrefix(`/api/devops`)'
- 'traefik.http.services.devops-api.loadbalancer.server.port=4004'
Admin Web
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.admin-web.rule=Host(`admin.bytelyst.com`)'
- 'traefik.http.services.admin-web.loadbalancer.server.port=3001'
DNS Configuration
Add DNS records pointing to your Traefik gateway server:
devops.bytelyst.com A <your-server-ip>
admin.bytelyst.com A <your-server-ip>
api.bytelyst.com A <your-server-ip>
SSL/TLS Configuration
Traefik can automatically handle SSL certificates with Let's Encrypt:
command:
- '--certificatesresolvers.myresolver.acme.tlschallenge=true'
- '--certificatesresolvers.myresolver.acme.email=admin@bytelyst.com'
- '--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json'
Then update router labels:
labels:
- 'traefik.http.routers.devops-web.tls=true'
- 'traefik.http.routers.devops-web.tls.certresolver=myresolver'
Cross-Navigation
DevOps Dashboard → Admin Dashboard
- Header includes a "Platform Admin" link with Shield icon.
- Opens admin dashboard in a new tab.
- Uses
NEXT_PUBLIC_ADMIN_WEB_URL.
Admin Dashboard → DevOps Dashboard
- Sidebar includes a "DevOps Dashboard" link with Server icon.
- Opens devops dashboard in a new tab.
- Uses
NEXT_PUBLIC_DEVOPS_WEB_URL.
Shared Authentication
- Platform Service Auth: Both authenticate against platform-service.
- JWT Tokens: Same
JWT_SECRETvalidates tokens across services. - Per-Product Access: Admin access is checked per-product via membership roles.
- Single Sign-On: Users stay logged in across both dashboards.
Granting Access
To grant a user access to both dashboards:
- Ensure user exists in platform-service.
- Add admin membership for both products:
{
"memberships": [
{ "productId": "bytelyst-devops", "role": "admin", "plan": "pro" },
{ "productId": "bytelyst-platform", "role": "admin", "plan": "pro" }
]
}
Health Checks
- DevOps Backend:
http://localhost:4004/health - DevOps Web:
http://localhost:3049(Docker Compose host port; container :3000) - Admin Web:
http://localhost:3001 - Traefik Dashboard:
http://localhost:8080
Troubleshooting
Network issues
# Check if the platform network exists
docker network inspect learning_ai_common_plat_default
# Check container connectivity
docker network inspect learning_ai_common_plat_default | grep devops
Traefik routing
# Traefik dashboard
http://localhost:8080
# Traefik logs
docker logs $(docker ps -q -f name=gateway)
# Router config for the devops web container
docker inspect devops-web | grep -A 10 Labels
Authentication failures
- Verify
JWT_SECRETmatches across all services. - Check platform-service is accessible:
curl http://localhost:4003/health. - Ensure the user has the right product memberships.
Service not starting
docker logs devops-backend
docker logs devops-web
docker logs admin-web
docker ps
docker inspect devops-backend | grep -A 5 Health
Workspace dependency errors
pnpm install:common-plat # local sibling checkout
pnpm install:gitea # local Gitea registry
Service Management
Stop services
cd /opt/bytelyst/learning_ai_devops_tools/dashboard
docker-compose down
cd /opt/bytelyst/learning_ai_common_plat
docker-compose stop admin-web
Restart services
cd /opt/bytelyst/learning_ai_devops_tools/dashboard
docker-compose restart
cd /opt/bytelyst/learning_ai_common_plat
docker-compose restart admin-web
View logs
# DevOps
docker logs -f devops-backend
docker logs -f devops-web
# Admin
docker logs -f admin-web
# Traefik
docker logs -f gateway
Comparison with Trading Dashboard
| Feature | Trading | DevOps | Admin |
|---|---|---|---|
| Domain | invttrdg.bytelyst.com | devops.bytelyst.com | admin.bytelyst.com |
| Web Port | 3085 (host) | 3049 (host) / 3000 (ctr) | 3001 (host) |
| Backend Port | 4018 | 4004 | N/A |
| Network | platform_net | platform_net | default |
| Traefik | Yes | Yes | Yes |
| Auth | Platform | Platform | Platform |
Privilege Surface (Docker socket + host mounts)
The devops-backend container has root-equivalent access to the host. This
section documents exactly what is mounted, which routes use each mount, and
what the blast radius looks like if an admin token leaks. It exists so reviewers
don't have to reverse-engineer this from docker-compose.yml and the route
handlers — and so any future change to the mount set is reviewed against this
list rather than slipped in.
Mounts (from docker-compose.yml)
| Host path | Container path | Mode | Purpose |
|---|---|---|---|
/var/run/docker.sock |
/var/run/docker.sock |
rw | Allows docker CLI inside the container to control the host daemon. Used by the system and vm modules. Equivalent to root on the host. |
/opt/bytelyst/learning_ai_devops_tools/scripts |
/vm-scripts |
ro | Bash scripts the vm module shells out to (HostingerVM/*.sh). Read-only mount; the container cannot modify the script set. |
/var/log/vm-cleanup.log |
/host-logs/vm-cleanup.log |
rw | The vm cleanup script appends here; backend reads it via /api/vm/cleanup-log. |
/var/log/vm-health-check.log |
/host-logs/vm-health-check.log |
rw | Health-check probe output; backend reads it via /api/vm/health. |
/var/log/docker-watchdog.log |
/host-logs/docker-watchdog.log |
rw | Watchdog tail used by the VM panel. |
extra_hosts: host-gateway |
host.docker.internal-equivalent |
— | Lets the container reach host:11434 (Ollama) and other host-only services. Not a filesystem mount, but a privilege-relevant capability — the container can talk to anything bound to 127.0.0.1 on the host. |
The container's listening port (4004) is bound to 127.0.0.1 only, so the
API is not exposed to the public internet by this compose file — access is
expected via Tailscale or an SSH tunnel. Any reverse proxy in front of it
(Traefik in production) is responsible for its own auth + TLS.
What shells out + which routes (auth column = effective gate)
| Route | Handler module | What it executes | Auth |
|---|---|---|---|
GET /system/metrics |
system/repository.ts |
df -h ... |
requireAdmin |
GET /docker/stats |
system/repository.ts |
docker images / ps / volume ls / system df (read-only) |
requireAdmin |
POST /docker/cleanup |
system/repository.ts |
docker container prune -f, docker image prune -a -f, docker volume prune -f, docker builder prune -f (a fixed allow-list — request body picks one of the four "types") |
requireAdmin |
GET /vm/health |
vm/repository.ts |
bash $VM_SCRIPTS_PATH/vm-health-check.sh --json |
requireAdmin |
GET /vm/cleanup-log |
vm/repository.ts |
reads /host-logs/vm-cleanup.log |
requireAdmin |
GET /vm/cron-status |
vm/repository.ts |
crontab -l |
requireAdmin |
POST /vm/cleanup |
vm/repository.ts |
bash $VM_SCRIPTS_PATH/vm-cleanup.sh |
requireAdmin |
GET /vm/containers, .../unhealthy, .../:name/logs |
vm/repository.ts |
docker ps, docker inspect, docker stats, docker logs |
requireAdmin |
POST /vm/containers/:name/restart |
vm/repository.ts |
docker restart "<name>" (name is a path param — see "Known sharp edges" below) |
requireAdmin |
GET /vm/ollama/models, DELETE /vm/ollama/models/:name |
vm/repository.ts |
HTTP-only (talks to host Ollama via host-gateway). No shell-out. |
requireAdmin |
POST /code-quality/check |
code-quality/repository.ts |
npm run typecheck, npm run lint, npm run build, npm run test:run in the request-supplied projectPath. |
requireAdmin (added concurrently with this doc; previously unauthenticated — see the Phase 5 P1 commit) |
POST /deployments/trigger/:serviceId |
deployments/orchestrator.ts |
bash <service.scriptPath> from the registered service registry (paths are stored at create-time, not request-time). |
requireAdmin |
/hermes/ops (snapshot) |
hermes-ops/repository.ts |
Read-only probes: systemctl is-active/is-enabled, git status, du -sh, ps, tailscale ip, runuser -u uma -- systemctl --user .... No state-changing commands. |
requireAdmin (Phase 7 — private-only) |
/hermes/telemetry/:instance |
hermes-telemetry/repository.ts |
Read-only: runuser -u <user> -- hermes sessions/cron/memory/skills list --json, git -C <backup-repo> log, tail of the watchdog log. No state-changing commands. |
requireAdmin |
Blast radius if an admin token is leaked
Anyone holding a valid admin JWT for this product can, today:
- Run any of the four pre-defined
docker prunecommands (data loss for containers/images/volumes), restart any container, read any container's logs. - Trigger the host VM cleanup script and crontab listing.
- Trigger any deployment script registered in the service registry.
- Run
npm runlifecycle scripts in any directory the container can read (sincecode-quality/checktakes a caller-suppliedprojectPath). - Read the three host logs that are mounted in.
In other words, an admin token is equivalent to a host shell, modulo the
specific commands the codebase chooses to wrap. There is currently no
allow-list wrapper between the backend and the docker socket; the backend
constructs docker ... shell strings directly with execAsync.
Known sharp edges (track and shrink)
- Container name is interpolated into a shell string.
docker restart "${name}"and similar paths invm/repository.tsuseexecAsyncwith a template literal. The:namepath parameter is admin-only but is not validated against a^[a-zA-Z0-9._-]+$allow-list. Lock this down before exposing the dashboard to a wider admin pool. projectPathfor/code-quality/checkis unvalidated. The handler passes the caller-supplied path straight intoexecAsync({ cwd }). Even withrequireAdminadded, this should be constrained to a known set of project roots (or rejected if it escapes the workspace).- No per-route audit-log on shell-outs.
audit/repository.tsrecords deployment triggers but not/docker/cleanupor/vm/cleanup. A leaked token's actions are reconstructable only from container stdout + host logs. - The container runs as root. Both the backend
Dockerfileand the bind- mounts assume root. A non-root user withdockergroup membership would shrink the in-container blast radius without losing functionality (the socket is still root on the host); revisit when ready. fastify-rate-limitis global, not per-route. A leaked admin token currently isn't slowed down on the destructive endpoints any more than it is on read-only ones.
Mitigation roadmap (incremental, not all at once)
- P1: Allow-list wrapper around shell-outs. (
lib/shell.tsships withexecAllowed(no shell, justexecFilewith an explicit argv) plus per-command helpers —dockerRestart(name)validates against[a-zA-Z0-9][a-zA-Z0-9._-]{0,127},dockerPrune(kind, {all?})validates kind ∈ {container,image,volume,builder} and rejects--allon non-image,runBashScript(path, args, {allowedRoots})andrunNpmScript(script, {cwd, allowedRoots})lock both the script path and cwd to a configured set of roots. 17 unit tests cover the rejection paths;vm/restartContainerandsystem/dockerCleanupmigrated. Module covered by the test:coverage gate (≥95% lines).) - P1: Validate
/code-quality/check'sprojectPathagainst a configured set of allowed roots. (runCodeQualityChecknow callsassertPathInAllowedRoots(projectPath, getAllowedRoots())before any lifecycle script runs;getAllowedRoots()readsCODE_QUALITY_ALLOWED_ROOTS(colon-separated) with a default of/opt/bytelyst. The path is also re-resolved (normalised,..collapsed) before being passed torunNpmScript, which lifts it to its own argv slot — no shell interpolation.) - P2: Audit-log every shell-out (command + arg vector + actor + result).
(Audit schema extended with
action: 'shell-exec'+entityType: 'host'.POST /docker/cleanup,POST /vm/cleanup,POST /vm/containers/:name/restartnow write a Cosmos audit row including the actor (authUserId/authRole), entity id (docker-cleanup:<type>etc.), and a sanitized details payload. Audit writes are best-effort — a Cosmos hiccup logs a warn but never fails the request.) - P2: Run the backend container as a non-root user with
dockergroup membership; rebuild the Dockerfile accordingly. - P3: Move from
docker.sockto a thin daemon (docker-proxy-style) that exposes only the verbs the dashboard actually needs (stats,restart,logs, the fourprunevariants).
Operators reviewing whether to grant a new admin should read this whole section before doing so. Adding a new shell-out path in code is a privilege change and must update this table in the same commit.
Production Checklist
- Platform stack running with Traefik.
- DNS records configured.
- SSL/TLS certificates configured in Traefik.
- Environment variables set for production.
- Cosmos DB connection configured.
JWT_SECRETmatches across all services.- User memberships configured for access.
- Health checks passing.
- Cross-navigation links working.
- Monitoring and logging configured.
Features Implemented
Backend (port 4004)
- ✅ CI/CD pipeline with Gitea Actions
- ✅ E2E tests with Playwright (gated; see
.gitea/workflows/ci.yml) - ✅ Telemetry integration
- ✅ Error boundary
- ✅ CSRF protection with token refresh
- ✅ Service CRUD operations
- ✅ Deployment log retrieval (JSON polling — no SSE; see backend README)
- ✅ Audit logging
- ✅ Structured logging
- ✅ Database migrations
- ✅ Backup/restore functionality
- ✅ Performance monitoring (APM)
- ✅ System metrics (CPU, memory, disk)
- ✅ Docker cleanup endpoints
- ✅ OpenAPI/Swagger documentation at
/docs
Frontend (container :3000, host :3049 under Compose)
- ✅ Service management UI
- ✅ Deployment monitoring
- ✅ Health dashboard
- ✅ Metrics/charts page
- ✅ System management page
- ✅ Log viewer (poll-based)
- ✅ Accessibility features (ARIA, keyboard nav)
- ✅ PWA manifest
- ✅ Responsive design